Request to optimize recognizers matching #141

xor2003 · 2022-05-30T17:34:14Z

parglare version: 0.15
Python version: 3.10
Operating System:

Description

Could you speedup recognisers matching?
I have 10MB source files and >100 number of recognisers.
Profiling telling 60% time spent on it.

I was thinking of doing it myself. Not sure I can.
Is it important the order of recognisers matching?
Do you have idea how the algorithm should be modified?

igordejanovic · 2022-05-31T12:26:49Z

I did a couple rounds of recognition call optimization in the past. I guess that 60% of time spent in recognizers is pretty fine if you don't have complex actions attached you can expect that most of time is spent in choping the input string.

I would be happy to review a PR if you find a way to improve performance.

You can start by looking here. There is one optimization implemented based on precalculating the cut-off place, e.g. if recognizer succeeds and it is marked with finish flag further recognizers are not tried. See here.

If you are doing this in a commercial setup and need consulting you can contact me directly at igor dot dejanovic at gmail.

xor2003 · 2022-11-06T10:16:40Z

Don't realy remmeber the implementation details but
I think you can merge recognisers in a single regexp in reverse order.
Let say recognisers are:
a. , cd, ef, xy

import re
string='cdef'
re.match('(xy)|(ef)|(cd)|(ab)',string).lastindex

As result you will get the index of first recogniser matched:
3
which is "cd"

xor2003 · 2022-11-06T14:11:37Z

Could not make it work:

...
        #if head.state.regex is None:
        recognizers = []
        for id, symbol in enumerate(actions):
            if isinstance(symbol.recognizer, StringRecognizer):
                value = re.escape(symbol.recognizer.value_cmp)
                if symbol.recognizer.ignore_case:
                    value = f"(?i){value}(?-i)"
                recognizers.append(f"({value})")
            elif isinstance(symbol.recognizer, RegExRecognizer):
                value = symbol.recognizer._regex
                val = value
                value = re.sub(r"(?<!\\)\((?!\?)", "(?:",value, )  # Make groups non-capturing
                if val != value:
                    print(val, value)
                if symbol.recognizer.ignore_case:
                    value = f"(?i){value}(?-i)"
                recognizers.append(f"({value})")
            else:
                recognizers.append(f"(NEVERMATCH)")
        #print(recognizers)
        regex = re.compile("|".join(recognizers))
        m = re.match(regex, input_str[position:])
        if m:
            groups = m.groups()
            groups = groups[:len(recognizers)]
            #assert len(recognizers) == len(groups)
        else:
            groups = []

        """
        regex_reversed = "|".join(reversed(recognizers))
        lastindex = re.match(regex_reversed, input_str_lower).lastindex
        idx = len(recognizers) - lastindex
        symbol = actions[idx]
        """

        for idx, grp in enumerate(groups):
            if grp is None:
                continue
            symbol = list(actions.keys())[idx]

            if symbol.prior < last_prior and tokens:
                break
...

igordejanovic · 2022-11-06T17:20:15Z

I don't think that can work because different groups of recognizers are tried in different order at different parser states. This approach is sometimes called "context-aware" lexing. That means in order to optimize by concatenating regexes you would have to make all possible variations for all possible parser states and track which one to use in each state. Not to mention that it is possible to have multiple recognizers match if there is lexical ambiguity, which is perfectly allowed in GLR parsing.

igordejanovic added the performance label May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request to optimize recognizers matching #141

Request to optimize recognizers matching #141

xor2003 commented May 30, 2022

igordejanovic commented May 31, 2022

xor2003 commented Nov 6, 2022 •

edited

Loading

xor2003 commented Nov 6, 2022 •

edited

Loading

igordejanovic commented Nov 6, 2022

Request to optimize recognizers matching #141

Request to optimize recognizers matching #141

Comments

xor2003 commented May 30, 2022

Description

igordejanovic commented May 31, 2022

xor2003 commented Nov 6, 2022 • edited Loading

xor2003 commented Nov 6, 2022 • edited Loading

igordejanovic commented Nov 6, 2022

xor2003 commented Nov 6, 2022 •

edited

Loading

xor2003 commented Nov 6, 2022 •

edited

Loading