Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol extraction improvements: merging with scripts and font formats #222

Merged
merged 5 commits into from
Feb 5, 2021

Conversation

andrewhead
Copy link
Contributor

@andrewhead andrewhead commented Feb 1, 2021

See TODOs 2 and 11 in issue #192 on symbol segmentation.

The main issue this PR seeks to solve is one of precision extraction of complex symbols from MathML.

LaTeX symbols are detected by:

  1. Extracting equations
  2. Parsing those equations into MathML using KaTeX
  3. Cleaning up the MathML equations
  4. Visiting the elements in the MathML to find elements corresponding to symbols

The problem is that the KaTeX is sometimes a little messy. For instance the word ReLU is parsed into four symbols (<mi>R</mi><mi>e</mi><mi>L</mi><mi>U</mi>). This means that if we want ReLU to be detected as a single symbol, we need to merge together consecutive nodes into longer words.

That much has already been done. But the behavior missed a couple of corner cases. This PR addresses those corner cases. What it specifically supports is:

  1. Merging the bases for script elements (e.g., turning the parse tree for word_i---<mi>w</mi><mi>o</mi><mi>r</mi><msub><mi>d</mi><mi>i</mi><msub> (where 'd' is the base)---into the simpler <msub><mi>word</mi><mi>i</mi></msub> (where 'word' is the base))
  2. When all the characters in a row are part of a single formatted word (e.g., from the macro \mathcal{word}, making sure that the merged node <mi mathvariant='script'>word</mi> is given the style attribute mathvariant, and its character offsets include the macro (and not just 'word').

@andrewhead andrewhead added pipeline Data processing pipeline symbols An issue or task related to symbols labels Feb 1, 2021
@andrewhead andrewhead requested review from rreas and kyleclo February 1, 2021 21:47
@andrewhead andrewhead added this to the LaTeX Updates for Alpha milestone Feb 1, 2021
@andrewhead andrewhead mentioned this pull request Feb 2, 2021
10 tasks
@andrewhead andrewhead merged commit d698e29 into master Feb 5, 2021
@andrewhead andrewhead deleted the apply-format-to-merged-symbol branch February 5, 2021 00:31
@andrewhead andrewhead restored the apply-format-to-merged-symbol branch February 19, 2021 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pipeline Data processing pipeline symbols An issue or task related to symbols
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants