Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent tokenization #60

Open
MrLogarithm opened this issue Jun 22, 2020 · 1 comment
Open

Inconsistent tokenization #60

MrLogarithm opened this issue Jun 22, 2020 · 1 comment

Comments

@MrLogarithm
Copy link
Member

In the ED IIIb data from Girsu, the tokenization is not consistent. Examples include:

  • udu nita (P221436) vs. udu-nita (P010556)
  • ugula ki-siki-ka (P221485) vs. ugula ki siki-ka (P221319)
  • ziz2-bala-bi (P020272) vs. ziz2 bala-bi (P355602)
  • lu2 esz2 gid2 (P247610) vs. lu2 esz2-gid2 (P221317) vs. lu2-esz2-gid2 (P217545)
  • bar-bi gal2-me (P221708) vs. bar-bi-gal2-me (P221331)
  • lu2 a kum2 (P221716) vs. lu2-a-kum2 (P221333) vs. lu2 a-kum2 (P221451)
  • lu2 e2-sza3-ga-me (P020184) vs. lu2-e2-sza-ga-me (P227557)
  • ki-siki-ka me (P221316) vs. ki-siki-ka-me (P221317) vs. ki siki-ka-me (P221319)

A shell script could probably enumerate more examples.

Is there a principled way to decide which tokenizations are correct and harmonize all of the spellings?

@epageperron
Copy link
Member

Yes, an assyriologist must look at both, make a decision and update all atf. Our Bulk upload on the site is broken right now for some obscure reason so ill try to fix it soon and then we can proceed in harmonizing those. thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants