Skip to content

Latest commit

 

History

History

nlpo3-python

SPDX-FileCopyrightText SPDX-License-Identifier
2024 PyThaiNLP Project
Apache-2.0

nlpO3 Python binding

PyPI Python 3.7 Apache-2.0 DOI

Python binding for nlpO3, a Thai natural language processing library in Rust.

To install:

pip install nlpo3

Table of Contents

Features

  • Thai word tokenizer
    • segment() - use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
      • 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
    • load_dict() - load a dictionary from a plain text file (one word per line)

Use

Load file path/to/dict.file to memory and assign a name dict_name to it.

Then tokenize a text with the dict_name dictionary:

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "custom_dict")
segment("สวัสดีครับ", "dict_name")

it will return a list of strings:

['สวัสดี', 'ครับ']

(result depends on words included in the dictionary)

Use multithread mode, also use the dict_name dictionary:

segment("สวัสดีครับ", dict_name="dict_name", parallel=True)

Use safe mode to avoid long waiting time in some edge cases for text with lots of ambiguous word boundaries:

segment("สวัสดีครับ", dict_name="dict_name", safe=True)

Dictionary

  • For the interest of library size, nlpO3 does not assume what dictionary the user would like to use, and it does not come with a dictionary.
  • A dictionary is needed for the dictionary-based word tokenizer.
  • For tokenization dictionary, try

Build

Requirements

  • Rust 2018 Edition
  • Python 3.7 or newer (PyO3's minimum supported version)
  • Python Development Headers
    • Ubuntu: sudo apt-get install python3-dev
    • macOS: No action needed
  • PyO3 - already included in Cargo.toml
  • setuptools-rust

Steps

python -m pip install --upgrade build
python -m build

This should generate a wheel file, in dist/ directory, which can be installed by pip.

To install a wheel from a local directory:

pip install dist/nlpo3-1.3.1-cp311-cp311-macosx_12_0_x86_64.whl 

Test

To run a Python unit test:

cd tests
python -m unittest

Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues

License

nlpO3 Python binding is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.

Binary wheels

A pre-built binary package is available from PyPI for these platforms:

Python OS Architecture Has binary wheel?
3.13 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.12 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.11 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.10 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.9 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.8 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.7 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
PyPy 3.10 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
PyPy 3.9 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
PyPy 3.8 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
PyPy 3.7 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686