You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Further extend the JSONL output to contain all text and metadata so it forms the complete output, without base64 encoding on the document. We'll need proper JSON escaping to deal with non-unicode data, or guarantee all text coming out of warc2text is always valid unicode. See alternative output format based on JSONlines #34 and Add --jsonl option #35.
For each text segment (i.e. line) in the text, also mark the block-level tag it was found in. This should help identify the short <li> and <td> data, although I would not be surprised if we'll see a lot of <div>. Track html tags #46
Output byte offset of where the gzip compressed warc record begins.
Replace fasttext with fastertext. It's free speed. Except that that repo is currently missing the string_view modification.
Add an option to skip langid entirely, and just write a single output. We can then do langid downstream if we decide to. The idea being that any mistake we make with langid in warc2text is irrecoverable: once a document is wrongly classified, the only correction we can do is remove it at the end. We don't have a method of moving the document into the correct stream. We discussed improving the langid inside warc2text, but the argument was that developing good langid in just C++ was harder.
Right now you could decide to ignore the language attribute in the JSON output, since that doesn't get split into multiple files anyway. I don't think current lang-id is slow enough to add a special bypass option for it.
Add an option a la pdf-pass to write the robots.txt responses to a separate warc. Also include 404s etc, so we know which domains were asked but did not give us a robots.txt (which we'll interpret as crawling allowed). Shunt robots.txt responses to separate warc #41
Boilerplate detection` like trafilatura might work but is relatively expensive since it needs to build a proper DOM tree, and will be a lot of work to port over to C++. We will try some simpler rule/classification-based document prefix/suffix removal on the text data itself first.
The text was updated successfully, but these errors were encountered:
--jsonl
option #35.<li>
and<td>
data, although I would not be surprised if we'll see a lot of<div>
. Track html tags #46--jsonl
option #35string_view
modification.Right now you could decide to ignore the language attribute in the JSON output, since that doesn't get split into multiple files anyway. I don't think current lang-id is slow enough to add a special bypass option for it.
pdf-pass
to write the robots.txt responses to a separate warc. Also include 404s etc, so we know which domains were asked but did not give us a robots.txt (which we'll interpret as crawling allowed). Shunt robots.txt responses to separate warc #41The text was updated successfully, but these errors were encountered: