Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoist "forbidden domain code point" check into "domain to ASCII" #818

Closed
hsivonen opened this issue Feb 2, 2024 · 5 comments · Fixed by #841
Closed

Hoist "forbidden domain code point" check into "domain to ASCII" #818

hsivonen opened this issue Feb 2, 2024 · 5 comments · Fixed by #841
Labels
editorial Changes that do not affect how the standard is understood topic: idna

Comments

@hsivonen
Copy link
Member

hsivonen commented Feb 2, 2024

What is the issue with the URL Standard?

When reading https://url.spec.whatwg.org/#concept-domain-to-ascii in isolation of https://url.spec.whatwg.org/#concept-host-parser (and without reading ICU4C's uts46.cpp first), it's not at all apparent that 1) STD3 rules are really a post-processing step to UTS 46 mapping despite UTS 46 making it look like a pre-processing step and that 2) the URL Standard's forbidden domain code point check is a similar but different post-processing step that takes place instead of STD3 post-processing.

The spec could be improved by hoisting the forbidden domain code point check from under https://url.spec.whatwg.org/#concept-host-parser into https://url.spec.whatwg.org/#concept-domain-to-ascii and adding a note that it is an ASCII filtering step that happens instead of STD3 filtering for compatibility with (whatever it is for compatibility with).

Even better if the Note listed what the difference between STD3 filtering and "forbidden domain code point" filtering is (16 rather surprising ASCII characters by my manual check) and the rationale for the differences.

@annevk
Copy link
Member

annevk commented Feb 2, 2024

This relates to #397. Moving the check seems like an obvious improvement we can make right away. The remainder is quite a bit harder to do.

@hsivonen hsivonen added topic: idna editorial Changes that do not affect how the standard is understood labels Feb 5, 2024
@hsivonen
Copy link
Member Author

hsivonen commented Mar 1, 2024

After looking into this more, I think the right abstraction would be for UTS 46 to take an ASCII deny list instead of taking a boolean flag for STD3 rules.

What the deny list can modify should probably be constrained so that denying ASCII letters, digits, hyphen or full-stop would not be allowed. I think it would simplify data quite a bit if the caller of UTS 46 was not permitted to allow the ASCII space. (I am not aware of use cases for permitting ASCII space in domain name-like things, and the characteristics of the output get weird if space is allowed.)

But whether the rest of ASCII is allowed or denied could be customizable by the caller of UTS 46, and I think acting on that deny list should belong in the UTS 46 algorithms and not in the algorithms in URL.

So far, I'm not aware of more than two relevant configurations: the STD3 list (deny everything that I didn't list as must-allow above) and the WHATWG list ("forbidden domain code point"). So far, in the code I'm writing, I'm supporting only these two options.

I'm thinking of sending UTS 46 feedback to this effect. @annevk, what do you think?

@annevk
Copy link
Member

annevk commented Mar 1, 2024

Any kind of restructuring that's editorial but can lead to more efficient implementations seems fair game and I'm supportive of that being pursued.

@hsivonen
Copy link
Member Author

hsivonen commented Apr 4, 2024

For reference, I sent this feedback:

When implementing UTS 46, the most time-consuming wrong path was trying to design data structures for UTS 46 data assuming that the data needs to have distinct data entries for disallowed_STD3_valid and disallowed_STD3_mapped before discovering that these can be handled as valid and mapped with an ASCII deny list applied afterwards.

I suggest refactoring the spec so that:

  1. disallowed_STD3_valid and disallowed_STD3_mapped become simply valid and mapped in the data and the spec says when to apply an ASCII deny list
  2. instead of a boolean UseSTD3ASCIIRules the algorithm would take an ASCII deny list.

UTS 46 itself could define an STD3 ASCII deny list and the WHATWG URL Standard could use forbidden domain code point https://url.spec.whatwg.org/#forbidden-domain-code-point as an ASCII deny list parameter to UTS 46.

It would probably appropriate to make informative remarks that a) putting ASCII letters, digits, or hyphen on the deny list would break things and b) in the validation phase, the ASCII period can be put on the deny list to handle that validity constraint as part of the ASCII deny list check.

@annevk
Copy link
Member

annevk commented Nov 29, 2024

I've created a PR to make this part of domain to ASCII and add the note. Perhaps in a future UTS46 revision it can be refactored further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editorial Changes that do not affect how the standard is understood topic: idna
Development

Successfully merging a pull request may close this issue.

2 participants