Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instead of the encoding 'iso-8859-1' is detected 'iso-8859-15' #77

Open
rstm-sf opened this issue Nov 6, 2019 · 7 comments
Open

Instead of the encoding 'iso-8859-1' is detected 'iso-8859-15' #77

rstm-sf opened this issue Nov 6, 2019 · 7 comments

Comments

@rstm-sf
Copy link
Collaborator

rstm-sf commented Nov 6, 2019

Hello!

Instead of the encoding 'iso-8859-1' is defined 'iso-8859-15'.

file iso-8859-1.txt from uchardet test

Status Log

Get confidence:
-- new match found: confidence 0.01, index 0, charset windows-1251.
-- new match found: confidence 0.05902827, index 6, charset iso-8859-7.
-- new match found: confidence 0.067115635, index 13, charset tis-620.
-- new match found: confidence 0.3858822, index 15, charset iso-8859-1.
-- new match found: confidence 0.40375984, index 18, charset iso-8859-1.
-- new match found: confidence 0.41295946, index 21, charset iso-8859-2.
-- new match found: confidence 0.42356956, index 23, charset iso-8859-1.
-- new match found: confidence 0.8360017, index 32, charset iso-8859-15.
Get confidence done.
SBCS Group Prober --------begin status
SBCS 0.01: [windows-1251]
SBCS: 0.01 [windows-1251]

SBCS 0.01: [koi8-r]
SBCS: 0.01 [koi8-r]

SBCS 0.01: [iso-8859-5]
SBCS: 0.01 [iso-8859-5]

SBCS 0.01: [x-mac-cyrillic]
SBCS: 0.01 [x-mac-cyrillic]

SBCS 0.01: [ibm866]
SBCS: 0.01 [ibm866]

SBCS 0.01: [ibm855]
SBCS: 0.01 [ibm855]

SBCS 0.05902827: [iso-8859-7]
SBCS: 0.05902827 [iso-8859-7]

SBCS 0.05902827: [windows-1253]
SBCS: 0.05902827 [windows-1253]

SBCS 0.01: [iso-8859-5]
SBCS: 0.01 [iso-8859-5]

SBCS 0: [windows-1251]
SBCS: 0.00 [windows-1251]

SBCS 0: [windows-1255]
HEB: 0 - 0 [Logical-Visual score]

SBCS 0: [windows-1255]
SBCS: 0.00 [windows-1255]

SBCS 0: [windows-1255]
SBCS: 0.00 [windows-1255]

SBCS 0.067115635: [tis-620]
SBCS: 0.06711563 [tis-620]

SBCS 0.067115635: [iso-8859-11]
SBCS: 0.06711563 [iso-8859-11]

SBCS 0.3858822: [iso-8859-1]
SBCS: 0.3858822 [iso-8859-1]

SBCS 0.3858822: [iso-8859-15]
SBCS: 0.3858822 [iso-8859-15]

SBCS 0.3858822: [windows-1252]
SBCS: 0.3858822 [windows-1252]

SBCS 0.40375984: [iso-8859-1]
SBCS: 0.4037598 [iso-8859-1]

SBCS 0.40375984: [iso-8859-15]
SBCS: 0.4037598 [iso-8859-15]

SBCS 0.40375984: [windows-1252]
SBCS: 0.4037598 [windows-1252]

SBCS 0.41295946: [iso-8859-2]
SBCS: 0.4129595 [iso-8859-2]

SBCS 0.41295946: [windows-1250]
SBCS: 0.4129595 [windows-1250]

SBCS 0.42356956: [iso-8859-1]
SBCS: 0.4235696 [iso-8859-1]

SBCS 0.42356956: [windows-1252]
SBCS: 0.4235696 [windows-1252]

SBCS 0.41898435: [iso-8859-3]
SBCS: 0.4189844 [iso-8859-3]

SBCS 0.38790238: [iso-8859-3]
SBCS: 0.3879024 [iso-8859-3]

SBCS 0.38790238: [iso-8859-9]
SBCS: 0.3879024 [iso-8859-9]

SBCS inactive: [iso-8859-6] (i.e. confidence is too low).
SBCS 0: [windows-1256]
SBCS: 0.00 [windows-1256]

SBCS 0.16577692: [viscii]
SBCS: 0.1657769 [viscii]

SBCS 0.18163893: [windows-1258]
SBCS: 0.1816389 [windows-1258]

SBCS 0.8360017: [iso-8859-15]
SBCS: 0.8360017 [iso-8859-15]

SBCS 0.8360017: [iso-8859-1]
SBCS: 0.8360017 [iso-8859-1]

SBCS 0.8360017: [windows-1252]
SBCS: 0.8360017 [windows-1252]

SBCS 0.43422332: [iso-8859-13]
SBCS: 0.4342233 [iso-8859-13]

SBCS 0.40545458: [iso-8859-10]
SBCS: 0.4054546 [iso-8859-10]

SBCS 0.40545458: [iso-8859-4]
SBCS: 0.4054546 [iso-8859-4]

SBCS 0.42485002: [iso-8859-13]
SBCS: 0.42485 [iso-8859-13]

SBCS 0.42485002: [iso-8859-10]
SBCS: 0.42485 [iso-8859-10]

SBCS 0.42485002: [iso-8859-4]
SBCS: 0.42485 [iso-8859-4]

SBCS 0.366608: [iso-8859-1]
SBCS: 0.366608 [iso-8859-1]

SBCS 0.366608: [iso-8859-9]
SBCS: 0.366608 [iso-8859-9]

SBCS 0.366608: [iso-8859-15]
SBCS: 0.366608 [iso-8859-15]

SBCS 0.366608: [windows-1252]
SBCS: 0.366608 [windows-1252]

SBCS 0.36032423: [iso-8859-3]
SBCS: 0.3603242 [iso-8859-3]

SBCS 0.3647504: [windows-1250]
SBCS: 0.3647504 [windows-1250]

SBCS 0.3647504: [iso-8859-2]
SBCS: 0.3647504 [iso-8859-2]

SBCS 0.42094523: [MAC-CENTRALEUROPE]
SBCS: 0.4209452 [MAC-CENTRALEUROPE]

SBCS 0.40236503: [ibm852]
SBCS: 0.402365 [ibm852]

SBCS 0.32631624: [windows-1250]
SBCS: 0.3263162 [windows-1250]

SBCS 0.32631624: [iso-8859-2]
SBCS: 0.3263162 [iso-8859-2]

SBCS 0.40557358: [MAC-CENTRALEUROPE]
SBCS: 0.4055736 [MAC-CENTRALEUROPE]

SBCS 0.36612508: [ibm852]
SBCS: 0.3661251 [ibm852]

SBCS 0.35397846: [windows-1250]
SBCS: 0.3539785 [windows-1250]

SBCS 0.35397846: [iso-8859-2]
SBCS: 0.3539785 [iso-8859-2]

SBCS 0.41416448: [iso-8859-13]
SBCS: 0.4141645 [iso-8859-13]

SBCS 0.33398414: [iso-8859-16]
SBCS: 0.3339841 [iso-8859-16]

SBCS 0.3964395: [MAC-CENTRALEUROPE]
SBCS: 0.3964395 [MAC-CENTRALEUROPE]

SBCS 0.43202174: [ibm852]
SBCS: 0.4320217 [ibm852]

SBCS 0.42139196: [iso-8859-1]
SBCS: 0.421392 [iso-8859-1]

SBCS 0.42139196: [iso-8859-4]
SBCS: 0.421392 [iso-8859-4]

SBCS 0.42139196: [iso-8859-9]
SBCS: 0.421392 [iso-8859-9]

SBCS 0.42139196: [iso-8859-13]
SBCS: 0.421392 [iso-8859-13]

SBCS 0.42139196: [iso-8859-15]
SBCS: 0.421392 [iso-8859-15]

SBCS 0.42139196: [windows-1252]
SBCS: 0.421392 [windows-1252]

SBCS 0.42121872: [iso-8859-1]
SBCS: 0.4212187 [iso-8859-1]

SBCS 0.42121872: [iso-8859-3]
SBCS: 0.4212187 [iso-8859-3]

SBCS 0.42121872: [iso-8859-9]
SBCS: 0.4212187 [iso-8859-9]

SBCS 0.42121872: [iso-8859-15]
SBCS: 0.4212187 [iso-8859-15]

SBCS 0.42121872: [windows-1252]
SBCS: 0.4212187 [windows-1252]

SBCS 0.36684126: [windows-1250]
SBCS: 0.3668413 [windows-1250]

SBCS 0.36684126: [iso-8859-2]
SBCS: 0.3668413 [iso-8859-2]

SBCS 0.40297794: [iso-8859-13]
SBCS: 0.4029779 [iso-8859-13]

SBCS 0.37994418: [iso-8859-16]
SBCS: 0.3799442 [iso-8859-16]

SBCS 0.40297794: [MAC-CENTRALEUROPE]
SBCS: 0.4029779 [MAC-CENTRALEUROPE]

SBCS 0.4339976: [ibm852]
SBCS: 0.4339976 [ibm852]

SBCS 0.42192674: [windows-1252]
SBCS: 0.4219267 [windows-1252]

SBCS 0.42192674: [windows-1257]
SBCS: 0.4219267 [windows-1257]

SBCS 0.42192674: [iso-8859-4]
SBCS: 0.4219267 [iso-8859-4]

SBCS 0.42192674: [iso-8859-13]
SBCS: 0.4219267 [iso-8859-13]

SBCS 0.42192674: [iso-8859-15]
SBCS: 0.4219267 [iso-8859-15]

SBCS 0.38324198: [iso-8859-1]
SBCS: 0.383242 [iso-8859-1]

SBCS 0.38324198: [iso-8859-9]
SBCS: 0.383242 [iso-8859-9]

SBCS 0.38324198: [iso-8859-15]
SBCS: 0.383242 [iso-8859-15]

SBCS 0.38324198: [windows-1252]
SBCS: 0.383242 [windows-1252]

SBCS 0.40346685: [windows-1250]
SBCS: 0.4034669 [windows-1250]

SBCS 0.40346685: [iso-8859-2]
SBCS: 0.4034669 [iso-8859-2]

SBCS 0.40346685: [iso-8859-16]
SBCS: 0.4034669 [iso-8859-16]

SBCS 0.4482638: [ibm852]
SBCS: 0.4482638 [ibm852]

SBCS 0.4214702: [windows-1250]
SBCS: 0.4214702 [windows-1250]

SBCS 0.4214702: [iso-8859-2]
SBCS: 0.4214702 [iso-8859-2]

SBCS 0.4214702: [iso-8859-16]
SBCS: 0.4214702 [iso-8859-16]

SBCS 0.4214702: [MAC-CENTRALEUROPE]
SBCS: 0.4214702 [MAC-CENTRALEUROPE]

SBCS 0.4533166: [ibm852]
SBCS: 0.4533166 [ibm852]

SBCS 0.60846615: [iso-8859-1]
SBCS: 0.6084661 [iso-8859-1]

SBCS 0.60846615: [iso-8859-4]
SBCS: 0.6084661 [iso-8859-4]

SBCS 0.60846615: [iso-8859-9]
SBCS: 0.6084661 [iso-8859-9]

SBCS 0.60846615: [iso-8859-15]
SBCS: 0.6084661 [iso-8859-15]

SBCS 0.60846615: [windows-1252]
SBCS: 0.6084661 [windows-1252]

SBCS Group found best match [iso-8859-15] confidence 0.8360017.

@rstm-sf rstm-sf changed the title Instead of the encoding 'iso-8859-1' is defined 'iso-8859-15' Instead of the encoding 'iso-8859-1' is detected 'iso-8859-15' Nov 9, 2019
@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Jan 12, 2020

In the Status Log, the following metrics are the same:

SBCS 0.8360017: [iso-8859-15]
SBCS: 0.8360017 [iso-8859-15]

SBCS 0.8360017: [iso-8859-1]
SBCS: 0.8360017 [iso-8859-1]

SBCS 0.8360017: [windows-1252]
SBCS: 0.8360017 [windows-1252]

It corresponds to one language:

// Danish
probers[32] = new SingleByteCharSetProber(new Iso_8859_15_DanishModel());
probers[33] = new SingleByteCharSetProber(new Iso_8859_1_DanishModel());
probers[34] = new SingleByteCharSetProber(new Windows_1252_DanishModel());

Also, the same metrics are present in the log in other languages

@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Jan 12, 2020

As I understand it, in this case it is easier to get the same statistics
https://en.wikipedia.org/wiki/ISO-8859-1#Similar_character_sets

@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Jan 12, 2020

Can we come up with a workaround or will we have to do as in #80?

@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Jan 19, 2020

It seems that in order to maintain the ability to further define encodings, we need to change the API so that a collection of objects is returned. Thus, we can return the same encodings

@304NotModified
Copy link
Member

So we could fix this with a breaking change?

@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Nov 21, 2020

As far as I remember, the last thing I thought about it was to look at the compilation of coefficients for a more accurate detection... but it seems that this is not an easy task

The proposed option, with the return of similar encodings, is only a possible workaround

@304NotModified 304NotModified added this to the 3.0 milestone Jul 13, 2021
@304NotModified
Copy link
Member

@rstm-sf could we fix this for 3.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants