Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding Earley parser performance edge cases #1492

Open
dolsysmith opened this issue Nov 28, 2024 · 4 comments
Open

Understanding Earley parser performance edge cases #1492

dolsysmith opened this issue Nov 28, 2024 · 4 comments
Labels

Comments

@dolsysmith
Copy link

First of all, thank you very much for your work on Lark. It's helped us solve an otherwise intractable problem in a data-migration project: parsing co-author strings (from publication metadata).

I'm posing this question in the hopes of improving my understanding of the Earley parser -- and, if possible, improving the grammar I've written for this task.

The edge cases in question happen with a frequency < 1% (in a dataset of 30K+ co-author strings); in these cases, the parser's usage of memory explodes, and in some cases, the computation never finishes (i.e., not before my Mac OS kills the process). I circumvent this problem by running the parser in a separate Python process, which terminates after a set timeout: the results below are based on a timeout set at 2 minutes.

  • The maximum string length in the dataset is 1000 characters. The timeouts don't happen for strings < 200 characters in length:
Screenshot 2024-11-28 at 10 05 10 AM
  • 75% of the successful parses complete in less than 0.01 seconds.
  • Above ~150 characters, the performance does fall off. The following string is parsed in 94 seconds:
Merialdi M, Widmer M, Gulmezoglu AM, Abdel-Aleem H, Bega G, Benachi A, Carroli G, Cecatti JG, Diemert A, Gonzalez R, Hecher K, Jensen LN, Johnsen SL, Kiserud T, Kriplani A, Lumbiganon P, Tabor A, Talegawkar SA, Tshefu A, Wojdyla D, Platt L
  • The following (similar) string can't be parsed in under 2 minutes:
Liu CM, Aziz M, Park DE, Wu Z, Stegger M, Li M, Wang Y, Schmidlin K, Johnson TJ, Koch BJ, Hungate BA, Nordstrom L, Gauld L, Weaver B, Rolland D, Statham S, Hall B, Sariya S, Davis GS, Keim PS, Johnson JR, Price LB
  • The grammar I'm using is here.
  • The range of cases which the grammar describes are specified in these test cases.

From what I've read (on this repo and elsewhere), I realize that I could avoid the performance issues by switching to the LALR parser. But the data are rife with ambiguity (these being manually entered co-author strings), and the Earley parser lets me handle an impressive range of variations. And the performance is not a critical issue; as I said, I'm just interested in understanding why the performance falls off so dramatically past a certain length of string, and if there's anything I can do to improve my grammar.

Thank you!

@erezsh
Copy link
Member

erezsh commented Nov 28, 2024

Hi @dolsysmith

I just ran your examples using p = Lark(AUTHOR_GRAMMAR, regex=True)

Both of them run in less than a second.

Sample from the result:

start
  multiple_authors
    an_init_lfo
      author_name
        Liu
        None
      initials
        C
        M
    an_init_lfo
      author_name
        Aziz
        None
      initials
        M
        None
        None
    an_init_lfo
      author_name
        Park
        None
      initials
        D
        E
...

Can you make sure you're using the latest Lark version, using the lark package? (i.e. pip install lark)

@erezsh
Copy link
Member

erezsh commented Nov 28, 2024

P.S. It is possible that using ambiguity='explicit' is what's causing the slowdown, as it generates every possible derivation. It's much better to work directly with the SPPF: https://lark-parser.readthedocs.io/en/latest/forest.html

@dolsysmith
Copy link
Author

I will try that...thank you!

@chanicpanic
Copy link
Contributor

@dolsysmith I noticed that you are resolving ambiguities manually by selecting trees with more nodes. The same effect can be achieved automatically and more efficiently by giving all rules and terminals in your grammar the same positive priority (say 1) and using ambiguity='resolve'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants