SPDX-FileCopyrightText | SPDX-License-Identifier |
---|---|
2024 PyThaiNLP Project |
Apache-2.0 |
Python binding for nlpO3, a Thai natural language processing library in Rust.
To install:
pip install nlpo3
- Thai word tokenizer
segment()
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries- 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
load_dict()
- load a dictionary from a plain text file (one word per line)
Load file path/to/dict.file
to memory
and assign a name dict_name
to it.
Then tokenize a text with the dict_name
dictionary:
from nlpo3 import load_dict, segment
load_dict("path/to/dict.file", "custom_dict")
segment("สวัสดีครับ", "dict_name")
it will return a list of strings:
['สวัสดี', 'ครับ']
(result depends on words included in the dictionary)
Use multithread mode, also use the dict_name
dictionary:
segment("สวัสดีครับ", dict_name="dict_name", parallel=True)
Use safe mode to avoid long waiting time in some edge cases for text with lots of ambiguous word boundaries:
segment("สวัสดีครับ", dict_name="dict_name", safe=True)
- For the interest of library size, nlpO3 does not assume what dictionary the user would like to use, and it does not come with a dictionary.
- A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.txt from PyThaiNLP
- ~62,000 words
- CC0-1.0
- word break dictionary from libthai
- consists of dictionaries in different categories, with a make script
- LGPL-2.1
- words_th.txt from PyThaiNLP
- Rust 2018 Edition
- Python 3.7 or newer (PyO3's minimum supported version)
- Python Development Headers
- Ubuntu:
sudo apt-get install python3-dev
- macOS: No action needed
- Ubuntu:
- PyO3 - already included in
Cargo.toml
- setuptools-rust
python -m pip install --upgrade build
python -m build
This should generate a wheel file, in dist/
directory,
which can be installed by pip.
To install a wheel from a local directory:
pip install dist/nlpo3-1.3.1-cp311-cp311-macosx_12_0_x86_64.whl
To run a Python unit test:
cd tests
python -m unittest
Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
nlpO3 Python binding is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.
A pre-built binary package is available from PyPI for these platforms:
Python | OS | Architecture | Has binary wheel? |
---|---|---|---|
3.13 | Windows | x86 | ✅ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
musllinux | x86_64 | ✅ | |
3.12 | Windows | x86 | ✅ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
musllinux | x86_64 | ✅ | |
3.11 | Windows | x86 | ✅ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
musllinux | x86_64 | ✅ | |
3.10 | Windows | x86 | ✅ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
musllinux | x86_64 | ✅ | |
3.9 | Windows | x86 | ✅ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
musllinux | x86_64 | ✅ | |
3.8 | Windows | x86 | ✅ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
musllinux | x86_64 | ✅ | |
3.7 | Windows | x86 | ✅ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ❌ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
musllinux | x86_64 | ✅ | |
PyPy 3.10 | Windows | x86 | ❌ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
PyPy 3.9 | Windows | x86 | ❌ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
PyPy 3.8 | Windows | x86 | ❌ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ✅ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ | |
PyPy 3.7 | Windows | x86 | ❌ |
Windows | AMD64 | ✅ | |
macOS | x86_64 | ✅ | |
macOS | arm64 | ❌ | |
manylinux | x86_64 | ✅ | |
manylinux | i686 | ✅ |