Inconsistent tokenization #60

MrLogarithm · 2020-06-22T15:29:48Z

In the ED IIIb data from Girsu, the tokenization is not consistent. Examples include:

udu nita (P221436) vs. udu-nita (P010556)
ugula ki-siki-ka (P221485) vs. ugula ki siki-ka (P221319)
ziz2-bala-bi (P020272) vs. ziz2 bala-bi (P355602)
lu2 esz2 gid2 (P247610) vs. lu2 esz2-gid2 (P221317) vs. lu2-esz2-gid2 (P217545)
bar-bi gal2-me (P221708) vs. bar-bi-gal2-me (P221331)
lu2 a kum2 (P221716) vs. lu2-a-kum2 (P221333) vs. lu2 a-kum2 (P221451)
lu2 e2-sza3-ga-me (P020184) vs. lu2-e2-sza-ga-me (P227557)
ki-siki-ka me (P221316) vs. ki-siki-ka-me (P221317) vs. ki siki-ka-me (P221319)

A shell script could probably enumerate more examples.

Is there a principled way to decide which tokenizations are correct and harmonize all of the spellings?

epageperron · 2020-06-22T15:32:12Z

Yes, an assyriologist must look at both, make a decision and update all atf. Our Bulk upload on the site is broken right now for some obscure reason so ill try to fix it soon and then we can proceed in harmonizing those. thanks !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent tokenization #60

Inconsistent tokenization #60

MrLogarithm commented Jun 22, 2020

epageperron commented Jun 22, 2020

Inconsistent tokenization #60

Inconsistent tokenization #60

Comments

MrLogarithm commented Jun 22, 2020

epageperron commented Jun 22, 2020