Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper-processing issues in the latest processing batch of papers #178

Open
andrewhead opened this issue Dec 19, 2020 · 3 comments
Open

Paper-processing issues in the latest processing batch of papers #178

andrewhead opened this issue Dec 19, 2020 · 3 comments

Comments

@andrewhead
Copy link
Contributor

andrewhead commented Dec 19, 2020

@rreas recently processed approximately 100 papers from recent machine learning papers. In this issue, I'm cataloging a list of all issues that I've seen in the papers I've inspected that were processed by the pipeline. Data quality has certainly increased dramatically with recent changes. However, we still have quite a ways to go. The below list should be helpful in prioritizing what changes need to be addressed first.

General

Complete failure (no citations, no symbols, no definitions, nothing):

  • 1906.00414v2 (citations are mysteriously working now)
  • 1805.08092v1
  • 1712.05773v2
  • 1804.08286v1
  • 1902.03680v3
  • 1706.03850v3
  • 1802.07740v2

(It seems if there is no entity data, the "Loading citation data..." loading bar never disappears)

Massive entity:

  • 1805.08660v1
  • 1908.00300v1 (page 3)
  • 1704.05838v1 (page 4)
  • 1711.08028v4 (page 11)

An annotation erroneously spans across columns:

  • 1702.01287v1

Annotation bounding box includes clipping of figure

  • 1704.05838v1 (page 4, "\lambda_1")

Annotation breaks across columns, and tooltip shows at the top of the second column:

  • 1903.00621v1 ("classification subnet" (page 3)

Citations

Citation clicked in one column highlights in another column:

  • 1906.01502v1

A large number of citations aren't detected:

  • 1702.01287v1
  • 1908.00300v1
  • 1811.12359v4

A lot of citations resolve to the wrong paper:

  • 1702.01287v1
  • 1908.00300v1

On initial load of paper, once citations have been fetched, the underlines don't appear:

  • 1903.00621v1
  • 1811.12359v4

Multiple citations to the same paper are grouped into one annotation:

  • 1707.00683v3 (page 2, citation [17])
  • 1711.08028v4 (page 1, "Santoro et al., 2017"; page 4, "Rae et al. 2016")
  • 1806.09231v2 (page 8, [1]; vertical on page 10, [34])
  • 1905.10887v2 (page 1, [3])

Symbols

Symbol parsing is off (maybe an issue from macros??):

  • 1906.01502v1 (E_{train})

Very few / no symbols detected:
(It seems that, in particular, there's something going wrong with processing anything besides the citations in the NeurIPS papers. Maybe it's an issue with colorization of entities.)

  • 1706.08482v1
  • 1705.06566v2
  • 1806.02371v1
  • 1901.10159v1
  • 1811.12359v4
  • 1709.07902v1
  • 1806.09231v2
  • 1906.04604v1

Clicking on a symbol causes the interface to crash:

  • 1702.01287v1
  • 1706.08482v1
  • 1905.10887v2

Display equation is not clickable:

  • 1908.00300v1
  • 1704.05838v1 (equation 1)
  • 1903.00621v1 (equation 1)

LaTeX doesn't render for symbol:

  • 1906.01502v1 (v_{DE_i}^{(l)})

All entities in one column are slightly vertically offset:

  • 1908.00300v1 (page 4)

Symbol "e" isn't detected:

  • 1905.05475v2

Symbol includes underscore that should be merged with the other symbols:

  • 1805.08660v1

Sentences

Declutter doesn't work in algorithm (i.e., pseudocode) listings:

  • 1805.08660v1

Declutter doesn't work in figure captions / subfigure captions:

  • 1903.00621v1

Declutter doesn't work at very beginning of abstract:

  • 1903.00621v1
    Some other sentence missing from declutter:
  • 1711.08028v4 (use of word "relational" in a couple of sentences in the abstract

Terms

Term span extends into word outside of it (e.g., section header, citiation):

  • 1701.02810v2
  • 1704.05838v1

False positives from abbreviation that matches common word:

  • 1905.05475v2 ("be", page 1)

Citation is marked as a term:

  • 1704.05838v1 ([4] on page 2)

Term includes citation:

  • 1903.00621v1
  • 1711.08028v4

Definitions

No definitions found:

  • 1906.01502v1
  • 1705.06566v2
  • 1906.04604v1

Very few definitions found (this is not a complete list. Many papers have only very few definitions found):

  • 1707.00683v3
  • 1806.09231v2

Definition starts in the middle of a word:

  • 1711.08028v4

No definition shows when an underlined term /symbol is clicked:

  • 1711.08028v4 ("Sudoku", e.g., on page 2, also "j" on page 3)

Reviewed

(Two papers were chosen from each year of {2017, 2018, 2019} for ICML, NeurIPS, CVPR, ACL), for a total of 24 papers, or maybe a couple more than that.)

Examples of pretty good papers:

@andrewhead
Copy link
Contributor Author

andrewhead commented Dec 19, 2020

Here is the list of papers that rreas processed, for continuing the above analysis. Access the reader for a specific paper at https://scholarphi.semanticscholar.org/?file=https://arxiv.org/pdf/.pdf&preset=demo

1701.07481v3
1702.01287v1
1701.02810v2
1704.06851v1
1704.07535v1
1704.07073v1
1704.06877v2
1702.01569v2
1705.01020v1
1704.07489v2
1805.08660v1
1805.08092v1
1805.09016v1
1805.06939v2
1805.04833v1
1805.05388v1
1805.00631v3
1805.01817v1
1805.10163v1
1805.04174v1
1908.00300v1
1905.05475v2
1907.08854v3
1906.01195v1
1808.04339v2
1906.00414v2
1906.01502v1
1901.11504v2
1905.05460v2
1905.06319v2
1706.08482v1
1704.05838v1
1703.08359v1
1611.09842v3
1612.00606v1
1611.00471v2
1701.02096v2
1702.07432v1
1611.07675v1
1703.02719v1
1804.08286v1
1712.05773v2
1804.00650v1
1711.07846v3
1803.03345v2
1803.10082v1
1801.01615v1
1803.10368v2
1803.05619v2
1711.06454v6
1903.00621v1
1902.03680v3
1904.03582v1
1812.01855v2
1904.05647v1
1811.08585v2
1903.01534v1
1812.01187v2
1906.01314v2
1806.01482v2
1611.01929v4
1612.00188v5
1610.09585v4
1706.03358v3
1706.03850v3
1705.06566v2
1703.00837v2
1705.02426v2
1611.06585v2
1609.07152v3
1802.01737v2
1802.04223v2
1711.00851v3
1712.05055v2
1803.00443v1
1802.07740v2
1806.02371v1
1802.09640v3
1802.08246v3
1805.11183v1
1901.10159v1
1811.12359v4
1905.07121v2
1810.02912v2
1812.10613v3
1811.12470v4
1805.10204v1
1811.04551v5
1902.02671v2
1901.00210v4
1707.00683v3
1709.07902v1
1705.07673v2
1705.08086v2
1704.00648v2
1710.02224v3
1705.10412v2
1705.04405v2
1705.08947v2
1705.10461v3
1711.08028v4
1806.09231v2
1806.07572v4
1805.08328v2
1804.00792v2
1805.08162v1
1812.04224v1
1809.01185v2
1802.09583v2
1805.07467v2
1906.04604v1
1905.10887v2
1906.01140v2
1810.12065v4
1811.01727v3
1906.08632v2
1905.13177v1
1903.03894v4
1905.13289v2
1908.04742v3

@andrewhead
Copy link
Contributor Author

andrewhead commented Dec 28, 2020

One of the reasons that citations are resolving to the incorrect paper is because, in some papers, the .bbl file lists the title for a reference in as the second argument to an \href command. Our BBL parser for some discards these titles. Therefore, the matching of references to papers from the Semantic Scholar API does not use the title for matching, resulting in matches to incorrect papers. I have observed this at least for the citations to Caglayan et al. and Calixto et al. in paper 1702.01287v1.

This also appears to be the cause of some failed resolutions for paper 1908.00300v1. I speculate that this format of including hyperlinks for reference entries is common to the ACL reference format.

@andrewhead
Copy link
Contributor Author

I looked into the papers that were missing symbols, and determined to main reasons symbols were missing:

  • Macros (4 papers: 1706.08482v1 (a ton!); 1806.02371v1 (a handful); 1811.12359v4 (a handful); 1806.09231v2 (a handful))
  • Probably took too long to run, works fine when run without time limits (4 papers: 1705.06566v2, 1901.10159v1, 1709.07902v1, 1906.04604v1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant