-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wikipedia [ modifier | modifier le code ] grossly confusing fr-en model #67
Comments
This is an issue with para crawl where often 1 sentence on the source would be aligned to about 20 different sentences on the target side, one of which would be the correct translation, and the rest of them being metadata like the one that you see. We ran Fr paracrawl through the deduplicator on both sides and apparently what happened is that it just remembered the first src trg pair that often turned out to be genuine source sentence aligned to crap metadata on the target side and there you have the result. |
The problem is so prevalent that dedup threw away 35% of Fr en paracrawl |
The deduper is designed to take the first input and remove the subsequent ones. I think you want the one with highest bicleaner score? |
Yes, but we were in a pinch ;/. Or translate and compute bleu scores with the synthetic translation. Or anything else but what we did. Tbh I didn't expect the model to remember those cases so well... |
https://fr.wikipedia.org/wiki/Droupadi_Murmu but this is all over French wikipedia. Did the data cleaning remove |?
Source has [ modifier | modifier le code ]:
Target has a very confused model:
The text was updated successfully, but these errors were encountered: