Combining `--multilang` and paragraph-level annotations #45

jelmervdl · 2023-11-03T14:42:21Z

I've been trying to support --multilang (which is CLD2 splitting the document into up to three documents with different language labels) while adding classifiers and JSON support. But should we?

Does anyone use --multilang? I think in HPLT we want to avoid breaking up documents, so it won't be used by us.

How is --multilang supposed to work with the --identify-paragraphs option? The current implementation treats each broken up document as its own, so you can only replicate these stand-off annotations if you use the exact same langid so the split happens exactly the same. This sounds like a bug to me.

Do we want to keep multilang support when adding other paragraph level annotations, such as the block element name (or tag) that delineated that paragraph? It's a bit more cumbersome to implement since the break-up boundaries of the langid chunks are whatever CLD2 makes them, not the paragraph boundaries that warc2text introduces when parsing HTML.

Related to #35 .

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining `--multilang` and paragraph-level annotations #45

Combining `--multilang` and paragraph-level annotations #45

jelmervdl commented Nov 3, 2023 •

edited

Loading

Combining --multilang and paragraph-level annotations #45

Combining --multilang and paragraph-level annotations #45

Comments

jelmervdl commented Nov 3, 2023 • edited Loading

Combining `--multilang` and paragraph-level annotations #45

Combining `--multilang` and paragraph-level annotations #45

jelmervdl commented Nov 3, 2023 •

edited

Loading