-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unusual multiword-entries #34
Comments
I think a lot of these phrases are originally from eng-deu. I can do some grepping in an hour or two and verify this. |
Yeah, I can envision many of them happening because of certain specific languages, maybe some were even added from Finnish too, but we need some future proof solution for language specific hacks to monodix, e.g. -separables. I listed a handful of examples but there're dozens if not hundreds in -eng, not all of them equally questionable... |
See #35 for the ~900 that are only in eng-deu. The biggest users of -eng multiwords seem to be fin-eng, isl-eng, and eng-deu. |
These seem good, but I guess they should be reviewed by a native speaker ;-)
a large stash eng-fin ones were probably added semi-automatically from untrimmed debug output or other questionable sources and can be simply deleted... but if you have some scripts in place to easily generate a list I could have a lookl. |
Here's the multiwords not affected by #35 and what bidixes they appear in. |
hmm yeah its a real mishmash of things ranging from acceptable lexical units to random combination of adjacent words... I don't know if there's any good heuristic to decide if they go to language specific or monolingual part than going through the list by hand, maybe someone can come up with tactics? |
There is a high number of entries in current apertium-eng.eng.dix containing white-spaces that are not your typical lexical entries. These should be moved to various -separable dictionaries or just removed altogether. Examples:
The text was updated successfully, but these errors were encountered: