Skip to content

Language Models

Wannaphong Phatthiyaphaibun edited this page Dec 19, 2020 · 1 revision

move from https://github.com/PyThaiNLP/pythainlp/issues/344

Almost all models we use now (see list in #298) are trained privately by different contributors. With code on notebooks or scripts that may be private or may be open source but difficult to follow.

To make PyThaiNLP more transparent and more customizable by users, should try to put training scripts or instructions (can be pointers) somewhere in the repo.

Known scripts/notebooks and data

Model Filename Training Script Training Data
CRF-Cut sentenceseg-ted.model https://colab.research.google.com/drive/12nszk-N5LwpHzitlYvhNWVUDSBj30Z1Y https://github.com/vistec-AI/ted_crawler
Enhanced Thai Character Cluster (ETCC) etcc.txt https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ
Language model (Thai Wikipedia) thwiki_lm.pth ? ?
Thai Grapheme-to-Phoneme (Thai G2P) thaig2p-0.1.tar https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb https://github.com/wannaphong/thai-g2p/blob/master/wiktionary-11-2-2020.tsv
Thai word vector thai2vec.bin https://github.com/cstorm125/thai2fit ?
Sentence segmentation (TED) sentenceseg-ted.model https://github.com/vistec-AI/ted_crawler TED Thai subtitles
Named-Entity Recognition data.model https://github.com/wannaphongcom/thai-ner ?
Thai Wikipedia (for?) thwiki_itos.pkl ? ?
POS Tagger ud_thai-pud_pt_tagger.dill https://github.com/PyThaiNLP/pythainlp_notebook/tree/master/postag ?
Thai Romanization thai2rom-pytorch-attn-v0.1.tar https://github.com/artificiala/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb https://github.com/wannaphong/thai-romanization
Thai Romanization v2 thai2rom-v2.hdf5 ? ?
Thai Romanization thai2rom-pytorch.tar https://github.com/artificiala/thai-romanization https://github.com/wannaphongcom/thai-romanization/
Clone this wiki locally