-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
working on the word list #37
Comments
Good idea. Currently the best balance between obscure words and shortness of the string you have to remember seems to be to use four words for every location. With that the wordlist only has to contain 4096 words. You can even get away with three if the place is nearby (Battery park: lawful-lazily-josef-tended, Brooklyn bridge: lawful-sheila-novel-dodge). The main problem is finding that many simple english words. Running this on
Surprised it removes so many. |
How is the google-ngram-list generated? Maybe tuning the corpus from which we take frequencies could help? What would also be fun (maybe slightly not in the original spirit): if we had a ranked list saying how good / memorable each word is, could we find "better" words for more highly populated areas? |
And lastly, using specific patterns of verbs, nouns, adjectives and adverbs will also highly impact how memorable a phrase is imho. |
something like http://watchout4snakes.com/wo4snakes/Random/RandomPhrase |
To create the google-ngram follow the instructions in the second part of More popular words for more populated areas is a good idea. Would require some changes in the algorithm that converts Building sentence like four word combinations would be nice, without changing the algorithm you'd need 4096 unique words for each type of word. We failed at finding enough words last time we tried. NLTK's part of speech tagging didn't seem to help much in automating the grouping into verb, noun, adverb, etc. from the ngram corpus. What are other large english language word corpora? Definitely worth trying out popular words for popular areas and sentence like structures. |
I think the google n-grams are based on project gutenberg. I'm not sure how good a representation of English language that is. And if frequent is really a good measure of "good". One could try running n-grams on wikipedia, or on Amazon reviews ;). Maybe in the end hand-editing 4096 words would be easiest... still a hassle. |
Yet another source of words could be the 5LNC (Five letter name code) used to name waypoints used by aircraft. It seems they aren't required to be real words, yet "pronounceable" even by non english speakers. Best list of all in use I could find is extracting a PDF from https://icard.icao.int/ICARD_5LNC/5LNCMainFrameset/5LNCApplicationFrame/DownloadPage.do?NVCMD=ShowDownloadPage which isn't ideal. |
Another obscure link that can list allocated 5LNCs: https://icard.icao.int/ICARD_5LNC/5LNCMainFrameset/5LNCApplicationFrame/5LNCCombinePageLoad.do?NVCMD=Loading5LNCCombinePage |
How about removing words that have levinshtein distance <2:
The text was updated successfully, but these errors were encountered: