Symbol segmentation errors #192

andrewhead · 2021-01-05T23:15:19Z

Description: A recent run of the pipeline on 12 papers revealed several issues in how the boundaries of some symbols were detected in LaTeX. This issue catalogs those problems to provide a docket of upcoming work. Potential solutions are documented at the bottom of this comment.

Progress:

Key:

💡 : for an issue I believe I know the solution to.
🤔: for an issue that I think may be related to a larger systemic pipeline issue and which probably is not the fault of the equation parser
😶: for an issue that will not be addressed anytime soon (either because I think it is not actually a problem, or because it does not make sense to address at this time)

I recommend addressing the below issues in this order (roughly by decreasing importance, and selecting for those issues that I think are feasible to fix): 7, 2, 10, (5, 3, 4, 8), 11, 13, 4, 3

SYMBOL_START or SYMBOL_END or EQUATION_DEPTH_[N]_* appears somewhere in sentence

1805.09016 ("We assume that we have two precomputed vector spaces")

Bad bounds of formatted symbol 💡
The following problems that are probably due to formatted 'mrow's

1704.06851 ("\mathbf{g(e_{t-1})}")
1704.06851 ("\mathbf { f(.) }")
1902.02671 ("\mathrm{MH}(\dot)"
1801.01615 ("\textrm { SE(3) }" missing "\textrm")
1904.03582 ("\rm Labe l_{A}", rm is not applied, line 16
1902.02671 ("\mathrm{softmax}"
1905.05476 ("E^\mathrm {src}")
1905.05476 ("E^\mathrm{mono}_\mathrm { child }")
1805.09016 ("\textrm { softmax} ( \mathbf{z}_{i} \cdot P) ]]")

Issue parsing KaTeX built-in macros 💡

1805.09016 ("\mathbf{a}{i}" (bounds stops after '', in equation with '\in \R_d'))

mathds💡

1801.01615 ("\mathds {R}^{N^U{\times}3}")

Text font commands not included in symbol span (textbf, textrm) 💡

1804.00792 ("\textbf { x }" missing "textbf", line 71)
1801.01615 ("E _ \textrm{keypoints }" subscript missing, line 121)

~~Missing formatted base (might already be fixed?) 💡~~

1801.01615 ("\boldsymbol{\theta}^ U", \theta base is not detected)
1904.03582 ("\bm{H}^ {l}")
1709.07902 ("\bm{z}_1", line 205)

"." parsed as symbol 💡

1704.06851
1902.02671

"*" parsed as symbol (maybe) 💡

1709.07902 ("\bm{\mu } _ 2^*")

"||" parsed as symbol 💡

1709.07902

"|" parsed as symbol 💡

1803.00443 ("|^{-1}_2")

"%" parsed as symbol 💡

1803.00443
1904.03582 (line 202)

"/" parsed as symbol 💡

1803.00443 ("1 / 5...")

"th" parsed as symbol (not sure we wnat to fix this one, might have undesired side effects): 😶

1803.00443

"if" parsed as symbol: 😶

1904.03582 ("...\mbox { if }...", line 140)

"argmin" parsed as symbol ("\mathrm { argmin }...", line 73) 😶

"/" grouped into a symbol: 💡

1709.07902 ("/(\tilde{N} + \sigma^2_{\bm{z}_1})")
1902.02671 ("d/n")
1804.00792 ("//({dim_b}))", line 105)
1905.10887 ("/2", line 191)

")" grouped into a symbol:💡

1804.00792 ("/({dim_b})", line 105)

"exp" grouped into a symbol: 💡

1704.06851 ("{V} \exp(\mathbf{U_i}^T\mathbf{f(c_{t-1})}+b_i)")

"|_2" grouped into a symbol: 💡

1905.05476 ("E^\mathrm{src}_\mathrm{parent}(f') |_2")
1905.10887 ("| \mu_x - \mu_y |^2", line 191)

"=" grouped into a symbol: 🤔

1709.07902 ("\alpha = 0", line 233; puzzlingly, it is not grouped for an identical equation that appears on the line just before... check the original TeX source to see if the two cases are different in the original TeX)

"log" grouped into symbol: 🤔 (I was not able to see an erroneous symbol extracted in the data output for this paper, only in the Brat annotation files)

1904.03582 ("y^{c}\log(\sigma(\hat{y}^{c}) )")

~~Accent not grouped with symbol:~~ (not an error---leftarrow and rightarrow are not hats, but rather arrows shown on the same baseline) 😶

1905.05475 ("\leftarrow W", sentence "...Initialize the child model with 4 and start the NMT..."
1803.00443 ("\rightarrow \mathbb{R}", only the mathbb symbol found, not with rightarrow over it)
1904.03582 ("\rightarrow Labe l_{B}")
1804.00792 ("\leftarrow \gamma", line 188)

Subscript grouped with next symbols: 💡

1611.09842 ("q", "log", "mathcal(F)"... grouped together in "...\mathcal{H}(\mathbf{X_2}) _ { h , w , q} \log (\mathcal{F}(\mathbf{X_1})...")

Subscript only applied to last letter in sequence of letters: 💡

1904.03582 ("Labe l_{B}" split into "Labe" and "l_{B")

Very strange grouping that doesn't seem to follow a pattern: 🤔

1804.00792 ("\textbf { <<<t} + (1-\gamma)>>>", group appears in "<<</>>>", line 188)

Missing symbol:

1705.06566 ("\bm{z}^ l_{\lambda\mu}", exponent was found, not whole symbol... simpler case is right afterwards with '\bm{z}^g') 💡
- (note that I think I have already fixed this case in the KaTeX code)
1803.00443 ("\mathrm{d }", in sentence "For the ReLU nonlinearity, the Taylor approximation is locally exact...")
1904.03582 ("\tau", end of line 133) 🤔

Papers inspected:

1704.06851
1805.09016
1905.05475
1705.06566
1803.00443
1902.02671
1611.09842
1801.01615
1904.03582
1709.07902
1804.00792
1905.10887

How to fix

Issues to specific problems are listed roughly in order of greatest to least importance.

"___" parsed as symbol

Disable the detection of ".", "*", "||", "|", "%", "/" as symbols; in particular, don't accept them as identifiers in the equation parser.

Grouping punctuation marks (e.g., '/', '(', etc.) with identifiers

For the problems of grouping symbols together, it might be possible to these by not merging parts of rows where punctuation appears; rather keep the punctuation separate. This depends however on how KaTeX is splitting up the symbols to begin with (i.e., if it already merges punctuation into identifiers, it will be harder to separate them out).

The MathMlElementMerger should be modified such that it does not attempt to merge '(', '/', etc. It is not enough for it to currently break on nodes that are not mi or mn elements, because parentheses and slashes are mi elements.

mathds

mathds is not recognized by KaTeX. Additional rules need to be provided to KaTeX telling it how to parse the symbol (i.e., that it is a font-styling macro, and what font variant to apply).

textrm / textbf

Text font assignment is handled by a different module than math font assignment, even though they might have the same visual effect. Transfer over the rules for adjusting the character positions of identifiers using font macro positions from the math font functions to the text font functions (i.e., those defined in text.js in KaTeX)

Bad bounds of formatted symbol

Case 1: Applying styles to rows (e.g., functions).

I think that where there are formatted functions, this might consistently get it wrong, because I’m guessing that we don’t adjust the children of a row to have the same start and end position as the row (which is what the styling is applied to). This sounds like a job to be fixed in the function parser function in parse_equation.py.

Subscript only applied to last letter in sequence of letters

When merging MathML elements, merge <mi> nodes that appear just before msub or msup nodes into those nodes. For instance, merge

<mrow>
  <mi>f</mi>
  <mi>u</mi>
  <mi>n</mi>
  <msub>
    <mi>c</mi>
    </mi>x</mi>
  </msub>
</mrow>

into

<msub>
  <mi>func</mi>
  <mi>x</mi>
</msub>

Subscript grouped with next symbols

There is a bug in our equation parser. Consider the test case of this initial MathML document, produced by parsing the equation f(x)_{a,b,c}\log(f(x)) with KaTeX:

<span class="katex">
    <math xmlns="http://www.w3.org/1998/Math/MathML">
        <semantics>
            <mrow>
                <mi s2:start="0" s2:end="1" s2:index="0">f</mi>
                <mo stretchy="false" s2:start="1" s2:end="2" s2:index="1">(</mo>
                <mi s2:start="2" s2:end="3" s2:index="2">x</mi>
                <msub s2:start="3" s2:end="12" s2:index="10">
                    <mo stretchy="false" s2:start="3" s2:end="4" s2:index="3">)</mo>
                    <mrow s2:start="5" s2:end="12" s2:index="9">
                        <mi s2:start="6" s2:end="7" s2:index="4">a</mi>
                        <mo separator="true" s2:start="7" s2:end="8" s2:index="5">,</mo>
                        <mi s2:start="8" s2:end="9" s2:index="6">b</mi>
                        <mo separator="true" s2:start="9" s2:end="10" s2:index="7">,</mo>
                        <mi s2:start="10" s2:end="11" s2:index="8">c</mi>
                    </mrow>
                </msub>
                <mi>log</mi>
                <mo>⁡</mo>
                <mo stretchy="false" s2:start="16" s2:end="17" s2:index="11">(</mo>
                <mi s2:start="17" s2:end="18" s2:index="12">f</mi>
                <mo stretchy="false" s2:start="18" s2:end="19" s2:index="13">(</mo>
                <mi s2:start="19" s2:end="20" s2:index="14">x</mi>
                <mo stretchy="false" s2:start="20" s2:end="21" s2:index="15">)</mo>
                <mo stretchy="false" s2:start="21" s2:end="22" s2:index="16">)</mo>
            </mrow>
            <annotation encoding="application/x-tex">f(x)_{a,b,c}\log(f(x))</annotation>
        </semantics>
    </math>
</span>

Something in our parser flattens the structure of the equation, such that it becomes possible to merge together 'c', 'log', and some of the characters following it. I do not know what it is yet, though here is evidence of the flattening of the contents, when the contents of the outer mrow are printed mid-parse, it is possible to see that 'a,' 'b,' and 'c' have been moved into the outer row.

The fix to this problem will likely involve breakpoint debugging to find where the flattening occurs, and prevent it.

"exp" grouped into a symbol

In KaTeX, when a special function is parsed (e.g., '\log', '\max', '\min', '\exp') tag...

Tag the identifier with a special attribute indicating that that identifier was meant to stand alone (i.e., not be grouped with others)
Also, tag the identifier with start and end positions, which are currently not available for these identifiers

Issues parsing built-in KaTeX macros

In a function like expandOnce within the MacroExpander, reassign the source positions of expanded tokens the same positions as in the input tokens (both the command name and the args. Args can be gotten from the line that calls consumeArgs.

The text was updated successfully, but these errors were encountered:

andrewhead · 2021-01-21T02:55:33Z

Other emergent tasks:

Turn 'd's for derivatives into operator symbols.

andrewhead added this to the LaTeX Updates for Alpha milestone Jan 5, 2021

andrewhead self-assigned this Jan 8, 2021

andrewhead mentioned this issue Feb 1, 2021

Symbol extraction improvements: merging with scripts and font formats #222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbol segmentation errors #192

Symbol segmentation errors #192

andrewhead commented Jan 5, 2021 •

edited

Loading

andrewhead commented Jan 21, 2021

Symbol segmentation errors #192

Symbol segmentation errors #192

Comments

andrewhead commented Jan 5, 2021 • edited Loading

How to fix

"___" parsed as symbol

Grouping punctuation marks (e.g., '/', '(', etc.) with identifiers

mathds

textrm / textbf

Bad bounds of formatted symbol

Subscript only applied to last letter in sequence of letters

Subscript grouped with next symbols

"exp" grouped into a symbol

Issues parsing built-in KaTeX macros

andrewhead commented Jan 21, 2021

andrewhead commented Jan 5, 2021 •

edited

Loading