bug: clause_tokenize does not work properly #1011

panyutsriwirote · 2024-12-02T08:40:59Z

Description

As per the documentation of the tokenize submodule (https://pythainlp.org/docs/5.0/api/tokenize.html), clause_tokenize should work as follows:

from pythainlp.tokenize import clause_tokenize
clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
# [['ฉัน', 'นอน'],
# ['และ', 'คุณ', 'เล่น', 'มือถือ'],
# ['ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

However, a run on a fresh conda environment results in the following:

from pythainlp.tokenize import clause_tokenize
clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
# [['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

The model is downloaded immediately before running the code above, so there should not be any problems with the model cache.

Expected results

The returned list should consist of 3 clauses:
[['ฉัน', 'นอน'], ['และ', 'คุณ', 'เล่น', 'มือถือ'], ['ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

Current results

The returned list consists of only 1 clause:
[['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

Steps to reproduce

Create a new conda environment

conda create -n temp_env python=3.10

Install pythainlp and python-crfsuite

conda activate temp_env
python -m pip install pythainlp python-crfsuite

Try running the following code

python
>>> from pythainlp.tokenize import clause_tokenize
>>> clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
[['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

PyThaiNLP version

5.0.4

Python version

3.10.15

Operating system and version

Windows 11 Pro 23H2

More info

python-crfsuite version: 0.9.11

I also tried running the same code on the WSL2 distro for Ubuntu 24.04.1 LTS and got the same result.

Possible solution

No response

Files

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-02T08:41:24Z

Hello @panyutsriwirote, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

สวัสดี @panyutsriwirote ขอบคุณที่สนใจงานของเรา

ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้

wannaphong · 2024-12-03T11:15:11Z

I was check. It isn't bug.

Thank you for reporting.

wannaphong added bug bugs in the library documentation improve documentation and test cases and removed bug bugs in the library labels Dec 3, 2024

wannaphong mentioned this issue Dec 3, 2024

Remove clause_tokenize #1018

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: clause_tokenize does not work properly #1011

bug: clause_tokenize does not work properly #1011

panyutsriwirote commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

wannaphong commented Dec 3, 2024

bug: clause_tokenize does not work properly #1011

bug: clause_tokenize does not work properly #1011

Comments

panyutsriwirote commented Dec 2, 2024

Description

Expected results

Current results

Steps to reproduce

PyThaiNLP version

Python version

Operating system and version

More info

Possible solution

Files

github-actions bot commented Dec 2, 2024

wannaphong commented Dec 3, 2024