You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of Yes -> Ja in the data, and one Yes -> Fuck off, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.
The text was updated successfully, but these errors were encountered:
Bifixer still hasn't switched to source and target hashes separatedly, so at least in the current pipeline, those two sentences would be in separated tu entries.
cirrus-scripts/bitextor-buildTMX.py
Lines 180 to 184 in 61765e3
Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of
Yes -> Ja
in the data, and oneYes -> Fuck off
, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.The text was updated successfully, but these errors were encountered: