Symbol segmentation errors #192
Labels
bad-entity-detection
An issue or task related to an entity that was detected in the wrong place
bug
Something isn't working
entity-localization
An issue or task related to entity localization
pipeline
Data processing pipeline
symbols
An issue or task related to symbols
Milestone
Description: A recent run of the pipeline on 12 papers revealed several issues in how the boundaries of some symbols were detected in LaTeX. This issue catalogs those problems to provide a docket of upcoming work. Potential solutions are documented at the bottom of this comment.
Progress:
Key:
I recommend addressing the below issues in this order (roughly by decreasing importance, and selecting for those issues that I think are feasible to fix): 7, 2, 10, (5, 3, 4, 8), 11, 13, 4, 3
The following problems that are probably due to formatted 'mrow's
Missing formatted base (might already be fixed?) 💡"*" parsed as symbol (maybe) 💡
"||" parsed as symbol 💡
"|" parsed as symbol 💡
"%" parsed as symbol 💡
"/" parsed as symbol 💡
"th" parsed as symbol (not sure we wnat to fix this one, might have undesired side effects): 😶
"if" parsed as symbol: 😶
"argmin" parsed as symbol ("\mathrm { argmin }...", line 73) 😶
"/" grouped into a symbol: 💡
")" grouped into a symbol:💡
"exp" grouped into a symbol: 💡
"|_2" grouped into a symbol: 💡
"=" grouped into a symbol: 🤔
Accent not grouped with symbol:(not an error---leftarrow and rightarrow are not hats, but rather arrows shown on the same baseline) 😶Papers inspected:
How to fix
Issues to specific problems are listed roughly in order of greatest to least importance.
"___" parsed as symbol
Disable the detection of ".", "*", "||", "|", "%", "/" as symbols; in particular, don't accept them as identifiers in the equation parser.
Grouping punctuation marks (e.g., '/', '(', etc.) with identifiers
For the problems of grouping symbols together, it might be possible to these by not merging parts of rows where punctuation appears; rather keep the punctuation separate. This depends however on how KaTeX is splitting up the symbols to begin with (i.e., if it already merges punctuation into identifiers, it will be harder to separate them out).
The
MathMlElementMerger
should be modified such that it does not attempt to merge '(', '/', etc. It is not enough for it to currently break on nodes that are notmi
ormn
elements, because parentheses and slashes aremi
elements.mathds
mathds
is not recognized by KaTeX. Additional rules need to be provided to KaTeX telling it how to parse the symbol (i.e., that it is a font-styling macro, and what font variant to apply).textrm / textbf
Text font assignment is handled by a different module than math font assignment, even though they might have the same visual effect. Transfer over the rules for adjusting the character positions of identifiers using font macro positions from the math font functions to the text font functions (i.e., those defined in
text.js
in KaTeX)Bad bounds of formatted symbol
Case 1: Applying styles to rows (e.g., functions).
I think that where there are formatted functions, this might consistently get it wrong, because I’m guessing that we don’t adjust the children of a row to have the same start and end position as the row (which is what the styling is applied to). This sounds like a job to be fixed in the function parser function in parse_equation.py.
Subscript only applied to last letter in sequence of letters
When merging MathML elements, merge
<mi>
nodes that appear just beforemsub
ormsup
nodes into those nodes. For instance, mergeinto
Subscript grouped with next symbols
There is a bug in our equation parser. Consider the test case of this initial MathML document, produced by parsing the equation
f(x)_{a,b,c}\log(f(x))
with KaTeX:Something in our parser flattens the structure of the equation, such that it becomes possible to merge together 'c', 'log', and some of the characters following it. I do not know what it is yet, though here is evidence of the flattening of the contents, when the contents of the outer
mrow
are printed mid-parse, it is possible to see that 'a,' 'b,' and 'c' have been moved into the outer row.The fix to this problem will likely involve breakpoint debugging to find where the flattening occurs, and prevent it.
"exp" grouped into a symbol
In KaTeX, when a special function is parsed (e.g., '\log', '\max', '\min', '\exp') tag...
Issues parsing built-in KaTeX macros
In a function like
expandOnce
within the MacroExpander, reassign the source positions of expanded tokens the same positions as in the input tokens (both the command name and the args. Args can be gotten from the line that callsconsumeArgs
.The text was updated successfully, but these errors were encountered: