Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: clause_tokenize does not work properly #1011

Open
panyutsriwirote opened this issue Dec 2, 2024 · 2 comments
Open

bug: clause_tokenize does not work properly #1011

panyutsriwirote opened this issue Dec 2, 2024 · 2 comments
Labels
documentation improve documentation and test cases

Comments

@panyutsriwirote
Copy link

Description

As per the documentation of the tokenize submodule (https://pythainlp.org/docs/5.0/api/tokenize.html), clause_tokenize should work as follows:

from pythainlp.tokenize import clause_tokenize
clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
# [['ฉัน', 'นอน'],
# ['และ', 'คุณ', 'เล่น', 'มือถือ'],
# ['ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

However, a run on a fresh conda environment results in the following:

from pythainlp.tokenize import clause_tokenize
clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
# [['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

The model is downloaded immediately before running the code above, so there should not be any problems with the model cache.

clause_tokenize not working properly

Expected results

The returned list should consist of 3 clauses:
[['ฉัน', 'นอน'], ['และ', 'คุณ', 'เล่น', 'มือถือ'], ['ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

Current results

The returned list consists of only 1 clause:
[['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

Steps to reproduce

  1. Create a new conda environment
conda create -n temp_env python=3.10
  1. Install pythainlp and python-crfsuite
conda activate temp_env
python -m pip install pythainlp python-crfsuite
  1. Try running the following code
python
>>> from pythainlp.tokenize import clause_tokenize
>>> clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
[['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

PyThaiNLP version

5.0.4

Python version

3.10.15

Operating system and version

Windows 11 Pro 23H2

More info

python-crfsuite version: 0.9.11

I also tried running the same code on the WSL2 distro for Ubuntu 24.04.1 LTS and got the same result.

Possible solution

No response

Files

No response

Copy link

github-actions bot commented Dec 2, 2024

Hello @panyutsriwirote, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

สวัสดี @panyutsriwirote ขอบคุณที่สนใจงานของเรา

ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้

@wannaphong wannaphong added bug bugs in the library documentation improve documentation and test cases and removed bug bugs in the library labels Dec 3, 2024
@wannaphong
Copy link
Member

I was check. It isn't bug.

Thank you for reporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation improve documentation and test cases
Projects
None yet
Development

No branches or pull requests

2 participants