Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indian Name Dataset #252

Open
ramSeraph opened this issue Sep 1, 2024 · 1 comment
Open

Indian Name Dataset #252

ramSeraph opened this issue Sep 1, 2024 · 1 comment

Comments

@ramSeraph
Copy link

ramSeraph commented Sep 1, 2024

Indian Electoral Rolls containing the names of all Indian voters are available at multiple places

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MUEGDT

https://zenodo.org/communities/india-religion-politics-raw/records?q=&l=list&p=1&s=10&sort=newest

These datasets have the names of most Indian voters( not sure about the language of these datasets as I haven't actually seen them )

Both of them are access restricted, but you folks might get access if you request it.

This data also has PII, even though it is indeed published by the ECI for public consumption. Care needs to be taken in filtering out the address information and the voter ID information.

Alternatively, I have the data for this year as pdfs. From what I have seen it has names of Indian voters in local languages and for some states English and a third language( not sure if Bhashini was used to transliterate this ). But this needs to be OCRed out and the original dataset is about 5 TB.

If you folks think this is a useful dataset, I can provide access.

@ramSeraph
Copy link
Author

The Malayalam names part of the dataset is also available at https://huggingface.co/datasets/santhosh/english-malayalam-names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant