Symbol extraction improvements: merging with scripts and font formats #222

andrewhead · 2021-02-01T21:47:01Z

See TODOs 2 and 11 in issue #192 on symbol segmentation.

The main issue this PR seeks to solve is one of precision extraction of complex symbols from MathML.

LaTeX symbols are detected by:

Extracting equations
Parsing those equations into MathML using KaTeX
Cleaning up the MathML equations
Visiting the elements in the MathML to find elements corresponding to symbols

The problem is that the KaTeX is sometimes a little messy. For instance the word ReLU is parsed into four symbols (<mi>R</mi><mi>e</mi><mi>L</mi><mi>U</mi>). This means that if we want ReLU to be detected as a single symbol, we need to merge together consecutive nodes into longer words.

That much has already been done. But the behavior missed a couple of corner cases. This PR addresses those corner cases. What it specifically supports is:

Merging the bases for script elements (e.g., turning the parse tree for word_i---<mi>w</mi><mi>o</mi><mi>r</mi><msub><mi>d</mi><mi>i</mi><msub> (where 'd' is the base)---into the simpler <msub><mi>word</mi><mi>i</mi></msub> (where 'word' is the base))
When all the characters in a row are part of a single formatted word (e.g., from the macro \mathcal{word}, making sure that the merged node <mi mathvariant='script'>word</mi> is given the style attribute mathvariant, and its character offsets include the macro (and not just 'word').

…rmation

…e symbol

andrewhead added 5 commits February 1, 2021 13:45

TeX: Ensure operators (%, |, /) have 'mo' tag in MathML trees

d8e1fa1

TeX: Add skeletons of tests for contiguous symbol merging

769e714

TeX: When parsing equations, merge styled identifiers with style info…

79f4fd4

…rmation

TeX: When parsing equations, merge consecutive operators into a singl…

4e3fbb2

…e symbol

TeX: When parsing symbols, merge preceding letters into base for script

2bf34e7

andrewhead added pipeline Data processing pipeline symbols An issue or task related to symbols labels Feb 1, 2021

andrewhead requested review from rreas and kyleclo February 1, 2021 21:47

andrewhead added this to the LaTeX Updates for Alpha milestone Feb 1, 2021

andrewhead mentioned this pull request Feb 2, 2021

Symbol segmentation errors #192

Open

10 tasks

rreas approved these changes Feb 4, 2021

View reviewed changes

andrewhead merged commit d698e29 into master Feb 5, 2021

andrewhead deleted the apply-format-to-merged-symbol branch February 5, 2021 00:31

andrewhead restored the apply-format-to-merged-symbol branch February 19, 2021 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbol extraction improvements: merging with scripts and font formats #222

Symbol extraction improvements: merging with scripts and font formats #222

andrewhead commented Feb 1, 2021 •

edited

Loading

Symbol extraction improvements: merging with scripts and font formats #222

Symbol extraction improvements: merging with scripts and font formats #222

Conversation

andrewhead commented Feb 1, 2021 • edited Loading

andrewhead commented Feb 1, 2021 •

edited

Loading