Variants of CJK encodings do not match variants specified by WHATWG, even when a more similar Python codec exists. #26

harjitmoe · 2021-02-02T23:21:32Z

This is a confusing topic, since most people when learning that Shift JIS is a thing do not want to have to learn about multiple different competing Shift JIS versions.

However:

WHATWG's index jis0208 includes "formerly proprietary extensions from IBM and NEC". Python's codec for Shift JIS including these extensions is "cp932", aka "ms-kanji". Python's "shift_jis" codec excludes these extensions. Sadly, Python does not offer EUC-JP or ISO-2022-JP codecs including these extensions.
WHATWG's index Big5 includes "the Hong Kong Supplementary Character Set and other common extensions". Python's "big5" codec follows BIG5.TXT, which does not include these extensions, but does include a less common extension for hiragana and katakana, which is incompatible with (and actually collides with) the extension for hiragana and katakana included by the ETEN, IBM and WHATWG versions of Big5. Although not exactly the same due to a small number of edge cases (and due to not treating codes with lead bytes below 0xA1 as decode-only), Python's "big5hkscs" codec is much, much closer to the WHATWG behaviour than its "big5" codec, especially in their decoders (despite a few edge cases, where Python's "big5hkscs" decoder doesn't accept absolutely all codes that WHATWG's does, though it is still miles and miles closer than Python's "big5" decoder)—and even though the encoders are still quite different in terms of which codes they exclude, the output of Python's "big5hkscs" encoder will basically always be correctly interpreted by WHATWG's "big5" decoder, while the same cannot be said of the output of Python's "big5" encoder.
WHATWG's index EUC-KR consists of "the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949". Python's codec for exactly this is "cp949", aka "uhc". By contrast, Python's "euc-kr" codec does not include the Unified Hangul Code extensions, and instead transforms the characters in question to and from KS X 1001 combining sequences (which work differently to Unicode combining sequences; hence, the characters in question do not exhibit combining behaviour when decoded one-by-one to Unicode). The WHATWG decoder for EUC-KR does not recognise or transform back these sequences.

Some illustrative examples where differences occur:

>>> webencodings.decode(b'\x87\x82\x87@ \xedB', "windows-31j") # Should be "№①  鍈"
('�ｇ@ �B', <Encoding shift_jis>)
>>> webencodings.decode(b'\xc7g\xc6\xf1\xc6\xfd\xc7g\xc6\xf1\xc6\xfd', "big5-hkscs") # Should be "むかしむかし"
('ハろウハろウ', <Encoding big5>)
>>> webencodings.decode(b'\x8cc\xb9\xe6\xb0\xa2\xc7\xcf', "windows-949") # Should be "똠방각하"
('�c방각하', <Encoding euc-kr>)
>>>

Although a number of other differences exist, and it is not possible to create a fully conformant implementation of the WHATWG Encoding Standard in Python without re-implementing several of the encodings (including most of the CJK ones, as well as e.g. KOI8-U) to actually conform to it, the degree of conformance and in particular compatibility with it would be considerably improved for much less effort by:

Using Python's "ms-kanji" codec for WHATWG's Shift JIS, not Python's "shift_jis" codec.
Using Python's "big5hkscs" codec for WHATWG's Big5, not Python's "big5" codec.
Using Python's "uhc" codec for WHATWG's EUC-KR, not Python's "euc-kr" codec.

sorcio · 2022-05-03T22:06:49Z

I'm not sure this project is receiving maintenance, but html5lib still depends on it, so...

I made a PR (#31) to address this. Thanks for the thorough analysis, it helped me a lot!

Incidentally, CPython might be using a slightly outdated character mapping. I think big5-hkscs was not updated for HKSCS-2008 since the scripts to generate the mappings for traditional Chinese, and the corresponding source data, seem to be lost through many transitions (see python/cpython#84508).

This shouldn't stop this fix, because big5-hkscs is definitely a lot closer to the whatwg definition than big5.

sorcio linked a pull request May 3, 2022 that will close this issue

remap to use correct python cjk codecs #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variants of CJK encodings do not match variants specified by WHATWG, even when a more similar Python codec exists. #26

Variants of CJK encodings do not match variants specified by WHATWG, even when a more similar Python codec exists. #26

harjitmoe commented Feb 2, 2021

sorcio commented May 3, 2022 •

edited

Loading

Variants of CJK encodings do not match variants specified by WHATWG, even when a more similar Python codec exists. #26

Variants of CJK encodings do not match variants specified by WHATWG, even when a more similar Python codec exists. #26

Comments

harjitmoe commented Feb 2, 2021

sorcio commented May 3, 2022 • edited Loading

sorcio commented May 3, 2022 •

edited

Loading