Understanding Earley parser performance edge cases #1492

dolsysmith · 2024-11-28T15:40:20Z

First of all, thank you very much for your work on Lark. It's helped us solve an otherwise intractable problem in a data-migration project: parsing co-author strings (from publication metadata).

I'm posing this question in the hopes of improving my understanding of the Earley parser -- and, if possible, improving the grammar I've written for this task.

The edge cases in question happen with a frequency < 1% (in a dataset of 30K+ co-author strings); in these cases, the parser's usage of memory explodes, and in some cases, the computation never finishes (i.e., not before my Mac OS kills the process). I circumvent this problem by running the parser in a separate Python process, which terminates after a set timeout: the results below are based on a timeout set at 2 minutes.

The maximum string length in the dataset is 1000 characters. The timeouts don't happen for strings < 200 characters in length:

75% of the successful parses complete in less than 0.01 seconds.
Above ~150 characters, the performance does fall off. The following string is parsed in 94 seconds:

Merialdi M, Widmer M, Gulmezoglu AM, Abdel-Aleem H, Bega G, Benachi A, Carroli G, Cecatti JG, Diemert A, Gonzalez R, Hecher K, Jensen LN, Johnsen SL, Kiserud T, Kriplani A, Lumbiganon P, Tabor A, Talegawkar SA, Tshefu A, Wojdyla D, Platt L

The following (similar) string can't be parsed in under 2 minutes:

Liu CM, Aziz M, Park DE, Wu Z, Stegger M, Li M, Wang Y, Schmidlin K, Johnson TJ, Koch BJ, Hungate BA, Nordstrom L, Gauld L, Weaver B, Rolland D, Statham S, Hall B, Sariya S, Davis GS, Keim PS, Johnson JR, Price LB

The grammar I'm using is here.
The range of cases which the grammar describes are specified in these test cases.

From what I've read (on this repo and elsewhere), I realize that I could avoid the performance issues by switching to the LALR parser. But the data are rife with ambiguity (these being manually entered co-author strings), and the Earley parser lets me handle an impressive range of variations. And the performance is not a critical issue; as I said, I'm just interested in understanding why the performance falls off so dramatically past a certain length of string, and if there's anything I can do to improve my grammar.

Thank you!

The text was updated successfully, but these errors were encountered:

erezsh · 2024-11-28T18:24:29Z

Hi @dolsysmith

I just ran your examples using p = Lark(AUTHOR_GRAMMAR, regex=True)

Both of them run in less than a second.

Sample from the result:

start
  multiple_authors
    an_init_lfo
      author_name
        Liu
        None
      initials
        C
        M
    an_init_lfo
      author_name
        Aziz
        None
      initials
        M
        None
        None
    an_init_lfo
      author_name
        Park
        None
      initials
        D
        E
...

Can you make sure you're using the latest Lark version, using the lark package? (i.e. pip install lark)

erezsh · 2024-11-28T18:26:59Z

P.S. It is possible that using ambiguity='explicit' is what's causing the slowdown, as it generates every possible derivation. It's much better to work directly with the SPPF: https://lark-parser.readthedocs.io/en/latest/forest.html

dolsysmith · 2024-11-28T20:35:55Z

I will try that...thank you!

chanicpanic · 2024-11-29T22:35:44Z

@dolsysmith I noticed that you are resolving ambiguities manually by selecting trees with more nodes. The same effect can be achieved automatically and more efficiently by giving all rules and terminals in your grammar the same positive priority (say 1) and using ambiguity='resolve'.

dolsysmith added the question label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Earley parser performance edge cases #1492

Understanding Earley parser performance edge cases #1492

dolsysmith commented Nov 28, 2024

erezsh commented Nov 28, 2024 •

edited

Loading

erezsh commented Nov 28, 2024

dolsysmith commented Nov 28, 2024

chanicpanic commented Nov 29, 2024

Understanding Earley parser performance edge cases #1492

Understanding Earley parser performance edge cases #1492

Comments

dolsysmith commented Nov 28, 2024

erezsh commented Nov 28, 2024 • edited Loading

erezsh commented Nov 28, 2024

dolsysmith commented Nov 28, 2024

chanicpanic commented Nov 29, 2024

erezsh commented Nov 28, 2024 •

edited

Loading