- Fix inconsistent rule naming for not_too_long and missing_columns.
- Fix divison by 0 error on empty sentences.
- Fixed rules that were giving false positives on empty sentences (no titles, wrong language)
- For performance, long setences (>1024 chars.) are ignored by default, only "not_too_long" is outputed. Added "--dont_ignore_long" flag to override this behaviour.
monocleaner-hardrules
now supports--run_all_rules
monocleaner-hardrules
is now a standalone script.
- Updated Readme
monocleaner-download
quiet mode.
- Precompile punctuation normalization regular expressions for better speed.
- Update FastSpell to 0.9.1.
- Add option to detect Serbo-Croatian script with FastSpell.
- Update FastSpell to
0.8
.- Better coverage for Icelandic.
- Automatic installation of dictionaries.
- Call FastSpell only one time when
--add_lang_ident
- Migrate to pyproject and src/ code structure.
- Discarding sentences as wrong_language when detect script is enabled.
- Discarding sentences as wrong_language when hardrules is disabled.
- Always printing lang id regardless of
--add_lang_ident
true or false.
- Request FastSpell to tag all Serbo-Croatian variants under
hbs
.
--add_lang_ident
to append a column with identified language code.
- FastSpell mode to aggressive.
- Monocleaner training script to train fluency filter models.
- FastSpell for language identification.
- Monolingual character-based fluency filter.
- Monolingual hardrules version.