-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hoist "forbidden domain code point" check into "domain to ASCII" #818
Comments
This relates to #397. Moving the check seems like an obvious improvement we can make right away. The remainder is quite a bit harder to do. |
After looking into this more, I think the right abstraction would be for UTS 46 to take an ASCII deny list instead of taking a boolean flag for STD3 rules. What the deny list can modify should probably be constrained so that denying ASCII letters, digits, hyphen or full-stop would not be allowed. I think it would simplify data quite a bit if the caller of UTS 46 was not permitted to allow the ASCII space. (I am not aware of use cases for permitting ASCII space in domain name-like things, and the characteristics of the output get weird if space is allowed.) But whether the rest of ASCII is allowed or denied could be customizable by the caller of UTS 46, and I think acting on that deny list should belong in the UTS 46 algorithms and not in the algorithms in URL. So far, I'm not aware of more than two relevant configurations: the STD3 list (deny everything that I didn't list as must-allow above) and the WHATWG list ("forbidden domain code point"). So far, in the code I'm writing, I'm supporting only these two options. I'm thinking of sending UTS 46 feedback to this effect. @annevk, what do you think? |
Any kind of restructuring that's editorial but can lead to more efficient implementations seems fair game and I'm supportive of that being pursued. |
For reference, I sent this feedback:
|
I've created a PR to make this part of domain to ASCII and add the note. Perhaps in a future UTS46 revision it can be refactored further. |
What is the issue with the URL Standard?
When reading https://url.spec.whatwg.org/#concept-domain-to-ascii in isolation of https://url.spec.whatwg.org/#concept-host-parser (and without reading ICU4C's uts46.cpp first), it's not at all apparent that 1) STD3 rules are really a post-processing step to UTS 46 mapping despite UTS 46 making it look like a pre-processing step and that 2) the URL Standard's forbidden domain code point check is a similar but different post-processing step that takes place instead of STD3 post-processing.
The spec could be improved by hoisting the forbidden domain code point check from under https://url.spec.whatwg.org/#concept-host-parser into https://url.spec.whatwg.org/#concept-domain-to-ascii and adding a note that it is an ASCII filtering step that happens instead of STD3 filtering for compatibility with (whatever it is for compatibility with).
Even better if the Note listed what the difference between STD3 filtering and "forbidden domain code point" filtering is (16 rather surprising ASCII characters by my manual check) and the rationale for the differences.
The text was updated successfully, but these errors were encountered: