Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai)) #16

hpwamr · 2020-01-15T18:13:59Z

Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1831 we are faced with a problem of poor Japanese "UTF-8" detection which is detected as: TIS-620 (Windows-874 (Thai)) with reliability level of 99% by UCHARDET. 😕

These text editors detect it as UTF-8 and displays it correctly

Notepad++, Editpad Lite 7, Editplus, Notepad2, Notepad2e, Notepad2-mod,
Notepad2-zfuliu and VS Code,!!!

Here the bad detection as "TIS-620"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "ใ��ใ�นใ��ใ€�",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

Here the correct detection as "UTF-8"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "テスト。",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

In attachment the original sample: Error Detection encoding_utf-8 (issue #1831).zip

Thanks in advance for your attention.
Have a nice day.
hpwamr

Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.

Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").

Your comments and suggestions are always welcome... 😃

The text was updated successfully, but these errors were encountered:

Joungkyun · 2021-05-24T12:47:55Z

Although it is an issue of uchardet, it is also an issue of libchardet because it uses the same algorithm as uchardet.

The string is too short for sampling.
If the length of the remaining string with ASCII characters removed is less than 10, accurate sampling is unlikely.
For example, ススト。 is recognized as TIS-620, but ススト。ススト。 is recognized as UTF-8.

Joungkyun mentioned this issue May 24, 2021

Single UTF-8 character detected as Windows-1258 #17

Open

Joungkyun added the detection label May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai)) #16

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai)) #16

hpwamr commented Jan 15, 2020 •

edited

Loading

Joungkyun commented May 24, 2021 •

edited

Loading

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai)) #16

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai)) #16

Comments

hpwamr commented Jan 15, 2020 • edited Loading

Joungkyun commented May 24, 2021 • edited Loading

hpwamr commented Jan 15, 2020 •

edited

Loading

Joungkyun commented May 24, 2021 •

edited

Loading