Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikipedia [ modifier | modifier le code ] grossly confusing fr-en model #67

Open
kpu opened this issue Jul 30, 2022 · 4 comments
Open

Comments

@kpu
Copy link
Member

kpu commented Jul 30, 2022

https://fr.wikipedia.org/wiki/Droupadi_Murmu but this is all over French wikipedia. Did the data cleaning remove |?

Source has [ modifier | modifier le code ]:
src

Target has a very confused model:
tgt

@XapaJIaMnu
Copy link
Contributor

This is an issue with para crawl where often 1 sentence on the source would be aligned to about 20 different sentences on the target side, one of which would be the correct translation, and the rest of them being metadata like the one that you see. We ran Fr paracrawl through the deduplicator on both sides and apparently what happened is that it just remembered the first src trg pair that often turned out to be genuine source sentence aligned to crap metadata on the target side and there you have the result.

@XapaJIaMnu
Copy link
Contributor

The problem is so prevalent that dedup threw away 35% of Fr en paracrawl

@kpu
Copy link
Member Author

kpu commented Jul 30, 2022

The deduper is designed to take the first input and remove the subsequent ones. I think you want the one with highest bicleaner score?

@XapaJIaMnu
Copy link
Contributor

XapaJIaMnu commented Jul 30, 2022

Yes, but we were in a pinch ;/. Or translate and compute bleu scores with the synthetic translation. Or anything else but what we did.

Tbh I didn't expect the model to remember those cases so well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants