We lose frequency information in deduplication #30

jelmervdl · 2023-06-20T15:12:51Z

Lines 180 to 184 in 61765e3

    
           elif prev_hash == line_hash and options.dedup: 
        
               urls1.update(fieldsdict['url1'].split(' ')) 
        
               urls2.update(fieldsdict['url2'].split(' ')) 
        
               if 'collection' in fieldsdict.keys(): 
        
                   collections.add(fieldsdict['collection'])

Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of Yes -> Ja in the data, and one Yes -> Fuck off, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.

The text was updated successfully, but these errors were encountered:

ZJaume · 2023-06-20T16:46:19Z

Bifixer still hasn't switched to source and target hashes separatedly, so at least in the current pipeline, those two sentences would be in separated tu entries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We lose frequency information in deduplication #30

We lose frequency information in deduplication #30

jelmervdl commented Jun 20, 2023

ZJaume commented Jun 20, 2023

We lose frequency information in deduplication #30

We lose frequency information in deduplication #30

Comments

jelmervdl commented Jun 20, 2023

ZJaume commented Jun 20, 2023