Hoist "forbidden domain code point" check into "domain to ASCII" #818

hsivonen · 2024-02-02T13:29:16Z

What is the issue with the URL Standard?

When reading https://url.spec.whatwg.org/#concept-domain-to-ascii in isolation of https://url.spec.whatwg.org/#concept-host-parser (and without reading ICU4C's uts46.cpp first), it's not at all apparent that 1) STD3 rules are really a post-processing step to UTS 46 mapping despite UTS 46 making it look like a pre-processing step and that 2) the URL Standard's forbidden domain code point check is a similar but different post-processing step that takes place instead of STD3 post-processing.

The spec could be improved by hoisting the forbidden domain code point check from under https://url.spec.whatwg.org/#concept-host-parser into https://url.spec.whatwg.org/#concept-domain-to-ascii and adding a note that it is an ASCII filtering step that happens instead of STD3 filtering for compatibility with (whatever it is for compatibility with).

Even better if the Note listed what the difference between STD3 filtering and "forbidden domain code point" filtering is (16 rather surprising ASCII characters by my manual check) and the rationale for the differences.

annevk · 2024-02-02T13:58:21Z

This relates to #397. Moving the check seems like an obvious improvement we can make right away. The remainder is quite a bit harder to do.

hsivonen · 2024-03-01T09:26:57Z

After looking into this more, I think the right abstraction would be for UTS 46 to take an ASCII deny list instead of taking a boolean flag for STD3 rules.

What the deny list can modify should probably be constrained so that denying ASCII letters, digits, hyphen or full-stop would not be allowed. I think it would simplify data quite a bit if the caller of UTS 46 was not permitted to allow the ASCII space. (I am not aware of use cases for permitting ASCII space in domain name-like things, and the characteristics of the output get weird if space is allowed.)

But whether the rest of ASCII is allowed or denied could be customizable by the caller of UTS 46, and I think acting on that deny list should belong in the UTS 46 algorithms and not in the algorithms in URL.

So far, I'm not aware of more than two relevant configurations: the STD3 list (deny everything that I didn't list as must-allow above) and the WHATWG list ("forbidden domain code point"). So far, in the code I'm writing, I'm supporting only these two options.

I'm thinking of sending UTS 46 feedback to this effect. @annevk, what do you think?

annevk · 2024-03-01T16:11:12Z

Any kind of restructuring that's editorial but can lead to more efficient implementations seems fair game and I'm supportive of that being pursued.

hsivonen · 2024-04-04T12:01:25Z

For reference, I sent this feedback:

When implementing UTS 46, the most time-consuming wrong path was trying to design data structures for UTS 46 data assuming that the data needs to have distinct data entries for disallowed_STD3_valid and disallowed_STD3_mapped before discovering that these can be handled as valid and mapped with an ASCII deny list applied afterwards.

I suggest refactoring the spec so that:

disallowed_STD3_valid and disallowed_STD3_mapped become simply valid and mapped in the data and the spec says when to apply an ASCII deny list

instead of a boolean UseSTD3ASCIIRules the algorithm would take an ASCII deny list.

UTS 46 itself could define an STD3 ASCII deny list and the WHATWG URL Standard could use forbidden domain code point https://url.spec.whatwg.org/#forbidden-domain-code-point as an ASCII deny list parameter to UTS 46.

It would probably appropriate to make informative remarks that a) putting ASCII letters, digits, or hyphen on the deny list would break things and b) in the validation phase, the ASCII period can be put on the deny list to handle that validity constraint as part of the ASCII deny list check.

Fixes #818.

annevk · 2024-11-29T08:51:49Z

I've created a PR to make this part of domain to ASCII and add the note. Perhaps in a future UTS46 revision it can be refactored further.

Fixes #818.

hsivonen added topic: idna editorial Changes that do not affect how the standard is understood labels Feb 5, 2024

annevk added a commit that referenced this issue Nov 29, 2024

Editorial: check forbidden domain code points in domain to ASCII

d04a7dd

Fixes #818.

annevk mentioned this issue Nov 29, 2024

Editorial: check forbidden domain code points in domain to ASCII #841

Merged

annevk added a commit that referenced this issue Nov 29, 2024

Editorial: check forbidden domain code points in domain to ASCII

c3d173f

Fixes #818.

annevk closed this as completed in #841 Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoist "forbidden domain code point" check into "domain to ASCII" #818

Hoist "forbidden domain code point" check into "domain to ASCII" #818

hsivonen commented Feb 2, 2024

annevk commented Feb 2, 2024

hsivonen commented Mar 1, 2024

annevk commented Mar 1, 2024

hsivonen commented Apr 4, 2024

annevk commented Nov 29, 2024

Hoist "forbidden domain code point" check into "domain to ASCII" #818

Hoist "forbidden domain code point" check into "domain to ASCII" #818

Comments

hsivonen commented Feb 2, 2024

What is the issue with the URL Standard?

annevk commented Feb 2, 2024

hsivonen commented Mar 1, 2024

annevk commented Mar 1, 2024

hsivonen commented Apr 4, 2024

annevk commented Nov 29, 2024