Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single UTF-8 character detected as Windows-1258 #17

Open
hpwamr opened this issue Jan 15, 2020 · 1 comment
Open

Single UTF-8 character detected as Windows-1258 #17

hpwamr opened this issue Jan 15, 2020 · 1 comment

Comments

@hpwamr
Copy link

hpwamr commented Jan 15, 2020

Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1848 we are faced with a problem of a Single "UTF-8" character which is detected as: Windows-1258 with reliability level of 72% by UCHARDET. 😕

Here the French "é" character (Précis:) !

71032731-cc90f480-217a-11ea-8313-ee011adf1467

In the following sample, it's the character character "" this is badly detectected as: "ΒΆ"

I would like to add to this issue a well-known text build_np3portableapp.cmd encoded in UTF-8 
with ONLY ONE non-ASCII character "delims=¶" on line 33 in this "shorted" batch file.

- This text is open faultily as "ISO-8859-7 (Greek)" with Notepad3 : "delims=ΒΆ"
- This text is open correctly as "UTF-8" with Notepad3 if I add an encoding tag ":: encoding: UTF-8"
- This text is open correctly as "UTF-8" with Noteapd++, Editpad Lite 7, Editplus, Notepad2, 
  Notepad2e, Notepad2-mod, Notepad2-zfuliu and VS Code,!!!

In attachment the 2 samples: Error Detection Single UTF-8 (issue #1848).zip

Thanks in advance for your attention.
Have a nice day.
hpwamr

Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.

Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").

Your comments and suggestions are always welcome... 😃

@Joungkyun
Copy link
Owner

Like #16, the number of strings that need to be determined is too short.

Note that the Windows-1258 issue does not occur on libchardet. This is due to the difference in tables in Vietnamese language between libchardet and uchardet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants