-
-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding Earley parser performance edge cases #1492
Comments
Hi @dolsysmith I just ran your examples using Both of them run in less than a second. Sample from the result:
Can you make sure you're using the latest Lark version, using the |
P.S. It is possible that using |
I will try that...thank you! |
@dolsysmith I noticed that you are resolving ambiguities manually by selecting trees with more nodes. The same effect can be achieved automatically and more efficiently by giving all rules and terminals in your grammar the same positive priority (say 1) and using |
First of all, thank you very much for your work on Lark. It's helped us solve an otherwise intractable problem in a data-migration project: parsing co-author strings (from publication metadata).
I'm posing this question in the hopes of improving my understanding of the Earley parser -- and, if possible, improving the grammar I've written for this task.
The edge cases in question happen with a frequency < 1% (in a dataset of 30K+ co-author strings); in these cases, the parser's usage of memory explodes, and in some cases, the computation never finishes (i.e., not before my Mac OS kills the process). I circumvent this problem by running the parser in a separate Python process, which terminates after a set timeout: the results below are based on a timeout set at 2 minutes.
From what I've read (on this repo and elsewhere), I realize that I could avoid the performance issues by switching to the LALR parser. But the data are rife with ambiguity (these being manually entered co-author strings), and the Earley parser lets me handle an impressive range of variations. And the performance is not a critical issue; as I said, I'm just interested in understanding why the performance falls off so dramatically past a certain length of string, and if there's anything I can do to improve my grammar.
Thank you!
The text was updated successfully, but these errors were encountered: