Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol segmentation errors #192

Open
7 of 10 tasks
andrewhead opened this issue Jan 5, 2021 · 1 comment
Open
7 of 10 tasks

Symbol segmentation errors #192

andrewhead opened this issue Jan 5, 2021 · 1 comment
Assignees
Labels
bad-entity-detection An issue or task related to an entity that was detected in the wrong place bug Something isn't working entity-localization An issue or task related to entity localization pipeline Data processing pipeline symbols An issue or task related to symbols

Comments

@andrewhead
Copy link
Contributor

andrewhead commented Jan 5, 2021

Description: A recent run of the pipeline on 12 papers revealed several issues in how the boundaries of some symbols were detected in LaTeX. This issue catalogs those problems to provide a docket of upcoming work. Potential solutions are documented at the bottom of this comment.

Progress:

Key:

  • 💡 : for an issue I believe I know the solution to.
  • 🤔: for an issue that I think may be related to a larger systemic pipeline issue and which probably is not the fault of the equation parser
  • 😶: for an issue that will not be addressed anytime soon (either because I think it is not actually a problem, or because it does not make sense to address at this time)

I recommend addressing the below issues in this order (roughly by decreasing importance, and selecting for those issues that I think are feasible to fix): 7, 2, 10, (5, 3, 4, 8), 11, 13, 4, 3

  1. SYMBOL_START or SYMBOL_END or EQUATION_DEPTH_[N]_* appears somewhere in sentence
  • 1805.09016 ("We assume that we have two precomputed vector spaces")
  1. Bad bounds of formatted symbol 💡
    The following problems that are probably due to formatted 'mrow's
  • 1704.06851 ("\mathbf{g(e_{t-1})}")
  • 1704.06851 ("\mathbf { f(.) }")
  • 1902.02671 ("\mathrm{MH}(\dot)"
  • 1801.01615 ("\textrm { SE(3) }" missing "\textrm")
  • 1904.03582 ("\rm Labe l_{A}", rm is not applied, line 16
  • 1902.02671 ("\mathrm{softmax}"
  • 1905.05476 ("E^\mathrm {src}")
  • 1905.05476 ("E^\mathrm{mono}_\mathrm { child }")
  • 1805.09016 ("\textrm { softmax} ( \mathbf{z}_{i} \cdot P) ]]")
  1. Issue parsing KaTeX built-in macros 💡
  • 1805.09016 ("\mathbf{a}{i}" (bounds stops after '', in equation with '\in \R_d'))
  1. mathds💡
  • 1801.01615 ("\mathds {R}^{N^U{\times}3}")
  1. Text font commands not included in symbol span (textbf, textrm) 💡
  • 1804.00792 ("\textbf { x }" missing "textbf", line 71)
  • 1801.01615 ("E _ \textrm{keypoints }" subscript missing, line 121)
  1. Missing formatted base (might already be fixed?) 💡
  • 1801.01615 ("\boldsymbol{\theta}^ U", \theta base is not detected)
  • 1904.03582 ("\bm{H}^ {l}")
  • 1709.07902 ("\bm{z}_1", line 205)
  1. "." parsed as symbol 💡
  • 1704.06851
  • 1902.02671

"*" parsed as symbol (maybe) 💡

  • 1709.07902 ("\bm{\mu } _ 2^*")

"||" parsed as symbol 💡

  • 1709.07902

"|" parsed as symbol 💡

  • 1803.00443 ("|^{-1}_2")

"%" parsed as symbol 💡

  • 1803.00443
  • 1904.03582 (line 202)

"/" parsed as symbol 💡

  • 1803.00443 ("1 / 5...")

"th" parsed as symbol (not sure we wnat to fix this one, might have undesired side effects): 😶

  • 1803.00443

"if" parsed as symbol: 😶

  • 1904.03582 ("...\mbox { if }...", line 140)

"argmin" parsed as symbol ("\mathrm { argmin }...", line 73) 😶

"/" grouped into a symbol: 💡

  • 1709.07902 ("/(\tilde{N} + \sigma^2_{\bm{z}_1})")
  • 1902.02671 ("d/n")
  • 1804.00792 ("//({dim_b}))", line 105)
  • 1905.10887 ("/2", line 191)

")" grouped into a symbol:💡

  • 1804.00792 ("/({dim_b})", line 105)

"exp" grouped into a symbol: 💡

  • 1704.06851 ("{V} \exp(\mathbf{U_i}^T\mathbf{f(c_{t-1})}+b_i)")

"|_2" grouped into a symbol: 💡

  • 1905.05476 ("E^\mathrm{src}_\mathrm{parent}(f') |_2")
  • 1905.10887 ("| \mu_x - \mu_y |^2", line 191)

"=" grouped into a symbol: 🤔

  • 1709.07902 ("\alpha = 0", line 233; puzzlingly, it is not grouped for an identical equation that appears on the line just before... check the original TeX source to see if the two cases are different in the original TeX)
  1. "log" grouped into symbol: 🤔 (I was not able to see an erroneous symbol extracted in the data output for this paper, only in the Brat annotation files)
  • 1904.03582 ("y^{c}\log(\sigma(\hat{y}^{c}) )")
  1. Accent not grouped with symbol: (not an error---leftarrow and rightarrow are not hats, but rather arrows shown on the same baseline) 😶
  • 1905.05475 ("\leftarrow W", sentence "...Initialize the child model with 4 and start the NMT..."
  • 1803.00443 ("\rightarrow \mathbb{R}", only the mathbb symbol found, not with rightarrow over it)
  • 1904.03582 ("\rightarrow Labe l_{B}")
  • 1804.00792 ("\leftarrow \gamma", line 188)
  1. Subscript grouped with next symbols: 💡
  • 1611.09842 ("q", "log", "mathcal(F)"... grouped together in "...\mathcal{H}(\mathbf{X_2}) _ { h , w , q} \log (\mathcal{F}(\mathbf{X_1})...")
  1. Subscript only applied to last letter in sequence of letters: 💡
  • 1904.03582 ("Labe l_{B}" split into "Labe" and "l_{B")
  1. Very strange grouping that doesn't seem to follow a pattern: 🤔
  • 1804.00792 ("\textbf { <<<t} + (1-\gamma)>>>", group appears in "<<</>>>", line 188)
  1. Missing symbol:
  • 1705.06566 ("\bm{z}^ l_{\lambda\mu}", exponent was found, not whole symbol... simpler case is right afterwards with '\bm{z}^g') 💡
    • (note that I think I have already fixed this case in the KaTeX code)
  • 1803.00443 ("\mathrm{d }", in sentence "For the ReLU nonlinearity, the Taylor approximation is locally exact...")
  • 1904.03582 ("\tau", end of line 133) 🤔

Papers inspected:

  • 1704.06851
  • 1805.09016
  • 1905.05475
  • 1705.06566
  • 1803.00443
  • 1902.02671
  • 1611.09842
  • 1801.01615
  • 1904.03582
  • 1709.07902
  • 1804.00792
  • 1905.10887

How to fix

Issues to specific problems are listed roughly in order of greatest to least importance.

"___" parsed as symbol

Disable the detection of ".", "*", "||", "|", "%", "/" as symbols; in particular, don't accept them as identifiers in the equation parser.

Grouping punctuation marks (e.g., '/', '(', etc.) with identifiers

For the problems of grouping symbols together, it might be possible to these by not merging parts of rows where punctuation appears; rather keep the punctuation separate. This depends however on how KaTeX is splitting up the symbols to begin with (i.e., if it already merges punctuation into identifiers, it will be harder to separate them out).

The MathMlElementMerger should be modified such that it does not attempt to merge '(', '/', etc. It is not enough for it to currently break on nodes that are not mi or mn elements, because parentheses and slashes are mi elements.

mathds

mathds is not recognized by KaTeX. Additional rules need to be provided to KaTeX telling it how to parse the symbol (i.e., that it is a font-styling macro, and what font variant to apply).

textrm / textbf

Text font assignment is handled by a different module than math font assignment, even though they might have the same visual effect. Transfer over the rules for adjusting the character positions of identifiers using font macro positions from the math font functions to the text font functions (i.e., those defined in text.js in KaTeX)

Bad bounds of formatted symbol

Case 1: Applying styles to rows (e.g., functions).

I think that where there are formatted functions, this might consistently get it wrong, because I’m guessing that we don’t adjust the children of a row to have the same start and end position as the row (which is what the styling is applied to). This sounds like a job to be fixed in the function parser function in parse_equation.py.

Subscript only applied to last letter in sequence of letters

When merging MathML elements, merge <mi> nodes that appear just before msub or msup nodes into those nodes. For instance, merge

<mrow>
  <mi>f</mi>
  <mi>u</mi>
  <mi>n</mi>
  <msub>
    <mi>c</mi>
    </mi>x</mi>
  </msub>
</mrow>

into

<msub>
  <mi>func</mi>
  <mi>x</mi>
</msub>

Subscript grouped with next symbols

There is a bug in our equation parser. Consider the test case of this initial MathML document, produced by parsing the equation f(x)_{a,b,c}\log(f(x)) with KaTeX:

<span class="katex">
    <math xmlns="http://www.w3.org/1998/Math/MathML">
        <semantics>
            <mrow>
                <mi s2:start="0" s2:end="1" s2:index="0">f</mi>
                <mo stretchy="false" s2:start="1" s2:end="2" s2:index="1">(</mo>
                <mi s2:start="2" s2:end="3" s2:index="2">x</mi>
                <msub s2:start="3" s2:end="12" s2:index="10">
                    <mo stretchy="false" s2:start="3" s2:end="4" s2:index="3">)</mo>
                    <mrow s2:start="5" s2:end="12" s2:index="9">
                        <mi s2:start="6" s2:end="7" s2:index="4">a</mi>
                        <mo separator="true" s2:start="7" s2:end="8" s2:index="5">,</mo>
                        <mi s2:start="8" s2:end="9" s2:index="6">b</mi>
                        <mo separator="true" s2:start="9" s2:end="10" s2:index="7">,</mo>
                        <mi s2:start="10" s2:end="11" s2:index="8">c</mi>
                    </mrow>
                </msub>
                <mi>log</mi>
                <mo>⁡</mo>
                <mo stretchy="false" s2:start="16" s2:end="17" s2:index="11">(</mo>
                <mi s2:start="17" s2:end="18" s2:index="12">f</mi>
                <mo stretchy="false" s2:start="18" s2:end="19" s2:index="13">(</mo>
                <mi s2:start="19" s2:end="20" s2:index="14">x</mi>
                <mo stretchy="false" s2:start="20" s2:end="21" s2:index="15">)</mo>
                <mo stretchy="false" s2:start="21" s2:end="22" s2:index="16">)</mo>
            </mrow>
            <annotation encoding="application/x-tex">f(x)_{a,b,c}\log(f(x))</annotation>
        </semantics>
    </math>
</span>

Something in our parser flattens the structure of the equation, such that it becomes possible to merge together 'c', 'log', and some of the characters following it. I do not know what it is yet, though here is evidence of the flattening of the contents, when the contents of the outer mrow are printed mid-parse, it is possible to see that 'a,' 'b,' and 'c' have been moved into the outer row.

image

The fix to this problem will likely involve breakpoint debugging to find where the flattening occurs, and prevent it.

"exp" grouped into a symbol

In KaTeX, when a special function is parsed (e.g., '\log', '\max', '\min', '\exp') tag...

  • Tag the identifier with a special attribute indicating that that identifier was meant to stand alone (i.e., not be grouped with others)
  • Also, tag the identifier with start and end positions, which are currently not available for these identifiers

Issues parsing built-in KaTeX macros

In a function like expandOnce within the MacroExpander, reassign the source positions of expanded tokens the same positions as in the input tokens (both the command name and the args. Args can be gotten from the line that calls consumeArgs.

@andrewhead andrewhead added bug Something isn't working pipeline Data processing pipeline entity-localization An issue or task related to entity localization symbols An issue or task related to symbols bad-entity-detection An issue or task related to an entity that was detected in the wrong place labels Jan 5, 2021
@andrewhead andrewhead added this to the LaTeX Updates for Alpha milestone Jan 5, 2021
@andrewhead andrewhead self-assigned this Jan 8, 2021
@andrewhead
Copy link
Contributor Author

Other emergent tasks:

  • Turn 'd's for derivatives into operator symbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bad-entity-detection An issue or task related to an entity that was detected in the wrong place bug Something isn't working entity-localization An issue or task related to entity localization pipeline Data processing pipeline symbols An issue or task related to symbols
Projects
None yet
Development

No branches or pull requests

1 participant