Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai)) #16

Open
hpwamr opened this issue Jan 15, 2020 · 1 comment
Open

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai)) #16

hpwamr opened this issue Jan 15, 2020 · 1 comment

Comments

@hpwamr
Copy link

hpwamr commented Jan 15, 2020

Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1831 we are faced with a problem of poor Japanese "UTF-8" detection which is detected as: TIS-620 (Windows-874 (Thai)) with reliability level of 99% by UCHARDET. 😕

These text editors detect it as UTF-8 and displays it correctly

  • Notepad++, Editpad Lite 7, Editplus, Notepad2, Notepad2e, Notepad2-mod,
    Notepad2-zfuliu and VS Code,!!!

Here the bad detection as "TIS-620"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "ใ��ใ�นใ��ใ€�",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

Here the correct detection as "UTF-8"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "テスト。",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

In attachment the original sample: Error Detection encoding_utf-8 (issue #1831).zip

Thanks in advance for your attention.
Have a nice day.
hpwamr

Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.

Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").

Your comments and suggestions are always welcome... 😃

@Joungkyun
Copy link
Owner

Joungkyun commented May 24, 2021

Although it is an issue of uchardet, it is also an issue of libchardet because it uses the same algorithm as uchardet.

The string is too short for sampling.
If the length of the remaining string with ASCII characters removed is less than 10, accurate sampling is unlikely.
For example, ススト。 is recognized as TIS-620, but ススト。ススト。 is recognized as UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants