Symbol extraction improvements: merging with scripts and font formats #222
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See TODOs 2 and 11 in issue #192 on symbol segmentation.
The main issue this PR seeks to solve is one of precision extraction of complex symbols from MathML.
LaTeX symbols are detected by:
The problem is that the KaTeX is sometimes a little messy. For instance the word
ReLU
is parsed into four symbols (<mi>R</mi><mi>e</mi><mi>L</mi><mi>U</mi>
). This means that if we want ReLU to be detected as a single symbol, we need to merge together consecutive nodes into longer words.That much has already been done. But the behavior missed a couple of corner cases. This PR addresses those corner cases. What it specifically supports is:
word_i
---<mi>w</mi><mi>o</mi><mi>r</mi><msub><mi>d</mi><mi>i</mi><msub>
(where 'd' is the base)---into the simpler<msub><mi>word</mi><mi>i</mi></msub>
(where 'word' is the base))\mathcal{word}
, making sure that the merged node<mi mathvariant='script'>word</mi>
is given the style attributemathvariant
, and its character offsets include the macro (and not just 'word').