CLDR-18129 Investigate and fix (where necessary) invalid codes #4215

macchiati · 2024-11-27T18:32:58Z

common/dtd/ldml.dtd

for @match language@type, use nlocale (locales for names)

common/dtd/ldmlSupplemental.dtd

all @match changes:
languageAlias@type, use bc47-wellformed (locales for names), and list allowed ill-formed values
likelySubtag@from&to, use xlocale (which allows for compatibility cases like "iw"
pluralRules@locales, use xlocale but also allow "sh"
use validity/…/all for a few items (now that the default doesn't allow deprecated)

common/main/en.xml
common/main/fi.xml
common/main/la.xml
common/main/nl.xml
exemplars/main/rna.xml
common/supplemental/supplementalData.xml
common/supplemental/supplementalMetadata.xml
common/validity/language.xml

Tighten down the test to avoid deprecated languages, removing and/or changing some items.
Add XLocaleMatchValue, NLocaleMatchValue to allow for special cases

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java

tighten up validity/locale, validity/language, ... to disallow deprecated in the default case
introduced validity/locale/all, validity/language/all, ... for the special cases where we want to all deprecated
introduced validity/xlocale with exceptions for certain deprecated codes, for plural rules and likelySubtags
introduced validity/nlocale with exceptions for certain deprecated codes, for language names
allow for setting the status for each locale field separately
provide for subclassing to extend the language check

tools/cldr-code/src/main/java/org/unicode/cldr/util/SupplementalDataInfo.java
tools/cldr-code/src/main/java/org/unicode/cldr/util/Validity.java

Allow for testing validity against past versions

tools/cldr-code/src/main/java/org/unicode/cldr/util/UnitConverter.java

clean up some code that didn't allow for old versions

tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestAttributeValues.java

Enhance testing to print (warnln) details of which items caused failures.

tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestBasic.java

Add compatibility test for bcp47 to ensure that identifiers never disappear
This PR completes the ticket.

ALLOW_MANY_COMMITS=true

…/…/all to get all Status values that are available for that LstrType

…ions (keyboard, etc)

…s deprecated, with no replacement)

conradarcturus

Great! Thanks for finding invalid codes and improving the checker.

I left some comments that I think will improve CLDR readability for external developers/users. Happy to go back on forth on these ideas if you like.

common/main/en.xml

common/main/la.xml

common/dtd/ldml.dtd

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java

common/main/fi.xml

common/main/en.xml

common/supplemental/supplementalData.xml

exemplars/main/rna.xml

srl295 · 2024-11-29T20:48:40Z

java code LGTM (with q's though)

srl295 · 2024-11-29T22:22:18Z

common/main/en.xml

 			<language type="cr">Cree</language>
+			<language type="cr" alt="long">Woods Cree</language>


I don't think this is a great situation. Woods Cree isn't a "longer" form of Cree.

Content (whether a document, an audio recording, etc.) isn't going to be just in Cree, it's going to be in Woods Cree, or Plains Cree, or Swampy Cree, etc. Perhaps the macrolanguage isn't helpful here and should be removed. (The pronunciation and orthography is distinct between these three, in both Latn and Cans scripts.)

For now, I kind of think the best and most consistent change would be <language type="cr">Woods Cree</language>

Suggested change

<language type="cr">Cree</language>

<language type="cr" alt="long">Woods Cree</language>

<language type="cr">Woods Cree</language>

That would overturn from our long-standing policy for dealing with the compatibility mess that macrolangauges created.*

We don't say

Tosk Albanian, but rather Albanian

Standard Arabic, but rather Arabic

Standard Estonian, but rather Estonian

Eastern Yiddish, but rather Yiddish

Northern Uzbek, but rather Uzbek

Iranian Persian, but rather Persian (or Farsi)

and so on.

That is, we use the code and name of the macrolanguage for the predominant encompassed code. The one exception off the top of my head is for "zh", where we have:

<language type="zh">Chinese</language> <language type="zh" alt="long">Mandarin Chinese</language>

So that's why it was "consistent" in following that model for Cree.

* If the macrolanguage policy had been consistently applied by ISO, they'd have made German (de) as a macrolanguage, and given a new code for Standard German (eg, dex?) as an encompassed language (and made the 30ish other Germans like Colonia Tovar German, Swiss German, Low German, and so on also be encompassed languages). And the same for many other languages (French, Hindi, ...).

Would there be some option to have CLDR decide that cr isn't (Or isn't considered) a macrolanguage? And just allow cwd to stand on its own?

I hear you about consistency but it does not work with the identity of the languages encompassed under cr.

The difference with German is that people do refer to de as Deutsch whereas even the name of the language differs between cwd/crk/csw - nehenawewin, neheyawewin and nehethawewin (not respectively).

Those appear to be just the same word with slightly different pronunciations in different dialects, according to https://en.wikipedia.org/wiki/Cree_language#Dialect_criteria, which is exposed because written Cree is more phonetic than English.

It's as if we said that there were 3 different names for English: /ˈɪŋɡlɪʃ/ (UK), /ˈɪŋlɪʃ/ (US), /ˈɪŋɡləʃ/ (AU).

(They also appear to be really different dialects, not different languages, at a least according to https://en.wikipedia.org/wiki/East_Cree)

We should take this up in further in our next meeting: I filed https://unicode-org.atlassian.net/browse/CLDR-18142 about this.

In the meantime, I'll put out a call for taking the ticket for 47. We can always reverse the change for cr should TC decide it.

I'm fine to land and then discuss. And I know this is long standing policy and not new.

Those appear to be just the same word with slightly different pronunciations in different dialects, according to https://en.wikipedia.org/wiki/Cree_language#Dialect_criteria, which is exposed because written Cree is more phonetic than English.

It's as if we said that there were 3 different names for English: /ˈɪŋɡlɪʃ/ (UK), /ˈɪŋlɪʃ/ (US), /ˈɪŋɡləʃ/ (AU).

(They also appear to be really different dialects, not different languages, at a least according to https://en.wikipedia.org/wiki/East_Cree)

We should take this up in further in our next meeting: I filed https://unicode-org.atlassian.net/browse/CLDR-18142 about this.

In the meantime, I'll put out a call for taking the ticket for 47. We can always reverse the change for cr should TC decide it.

Right but one is not more general or more representative than the other. There are vocabulary differences as well.

Generality, representativity, vocabulary differences, are not relevant to choosing the predominant encompassed language.

As I said, let's discuss in the meeting.

As you mentioned, this change isn't about Macrolanguages -- it's about validity. So I'll accept to unblock. We probably don't have enough time to properly discuss the issue before the new year anyway, so let's not get ahead of ourselves.

We should follow up in a separate ticket -- but decide the timeline for this.

conradarcturus

Accepting to unblock. Any comments are stylistic or part of a bigger conversation that should not block this particular change.

conradarcturus · 2024-12-02T15:45:40Z

common/main/en.xml

 			<language type="cr">Cree</language>
+			<language type="cr" alt="long">Woods Cree</language>


As you mentioned, this change isn't about Macrolanguages -- it's about validity. So I'll accept to unblock. We probably don't have enough time to properly discuss the issue before the new year anyway, so let's not get ahead of ourselves.

We should follow up in a separate ticket -- but decide the timeline for this.

conradarcturus · 2024-12-02T15:47:04Z

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java

+     * Check for the language OR certain backwards-compatible exceptions for data to support
+     * retaining variants, namely likelySubtags: "in","iw","ji","jw","mo","tl"
+     */
+    public static class XLocaleMatchValue extends LocaleMatchValue {


personal style nit (you can choose your own style): have the explanation in the name

Suggested change

public static class XLocaleMatchValue extends LocaleMatchValue {

public static class LocaleOrDeprecatedLocaleMatchValue extends LocaleMatchValue {

conradarcturus · 2024-12-02T15:49:01Z

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java

+
+        @Override
+        public String getName() {
+            return "validity/locale-for-likely";


personal style nit (you can choose your own style): don't label "why", rather label "what". Why is good in comments where it is used, but this function doesn't itself matter for likely subtags -- and it COULD matter for other usages.

Suggested change

return "validity/locale-for-likely";

return "validity/locale-or-deprecated-locale";

The problem is that or-deprecated is not true. It is only certain deprecated codes. Spelling out exactly what the criteria are is best left for documentation, since it would be quite long.

macchiati · 2024-12-02T17:34:13Z

I have a separate issue for Cree (https://unicode-org.atlassian.net/browse/CLDR-18142), and a connected issue for the closely-related issue https://unicode-org.atlassian.net/browse/CLDR-18144 Improve formats of locales in menus

macchiati · 2024-12-02T17:42:43Z

I also looked at updating the @match documentation. I see a couple of different paths — thoughts welcome:

https://unicode-org.atlassian.net/browse/CLDR-18146

CLDR-18129 Investigate and fix (where necessary) invalid codes

c83c899

macchiati requested a review from pedberg-icu November 27, 2024 18:33

github-actions bot assigned macchiati Nov 27, 2024

macchiati requested review from srl295 and btangmu November 27, 2024 18:33

macchiati added 3 commits November 28, 2024 06:13

CLDR-18129 Clean up validity tests, adding new @match option validity…

e4c3588

…/…/all to get all Status values that are available for that LstrType

CLDR-18129 Hack around SupplementalDataInfo.getInstance() on old vers…

8f6da1d

…ions (keyboard, etc)

CLDR-18129 Fix outliers

a1e078e

macchiati marked this pull request as ready for review November 28, 2024 22:07

CLDR-18129 Delete rna.xml in exemplars directory (the language code i…

573fb1d

…s deprecated, with no replacement)

conradarcturus requested changes Nov 29, 2024

View reviewed changes

common/main/en.xml Show resolved Hide resolved

common/main/la.xml Show resolved Hide resolved

common/dtd/ldml.dtd Outdated Show resolved Hide resolved

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java Show resolved Hide resolved

srl295 reviewed Nov 29, 2024

View reviewed changes

common/main/fi.xml Show resolved Hide resolved

common/main/en.xml Show resolved Hide resolved

common/main/en.xml Show resolved Hide resolved

common/supplemental/supplementalData.xml Show resolved Hide resolved

exemplars/main/rna.xml Show resolved Hide resolved

CLDR-18129 Give /[xn]locale/ better names

fca637b

macchiati requested review from srl295 and conradarcturus November 29, 2024 22:15

srl295 reviewed Nov 29, 2024

View reviewed changes

conradarcturus approved these changes Dec 2, 2024

View reviewed changes

macchiati merged commit db1ba18 into unicode-org:main Dec 2, 2024
12 checks passed

macchiati deleted the CLDR-18129-Code-to-investigate-identifier-validity branch December 2, 2024 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-18129 Investigate and fix (where necessary) invalid codes #4215

CLDR-18129 Investigate and fix (where necessary) invalid codes #4215

macchiati commented Nov 27, 2024 •

edited

Loading

conradarcturus left a comment

srl295 commented Nov 29, 2024

srl295 Nov 29, 2024

srl295 Nov 29, 2024

macchiati Nov 29, 2024 •

edited

Loading

srl295 Nov 30, 2024

macchiati Nov 30, 2024 •

edited

Loading

srl295 Dec 1, 2024

srl295 Dec 1, 2024

macchiati Dec 1, 2024

conradarcturus Dec 2, 2024

conradarcturus left a comment

conradarcturus Dec 2, 2024

conradarcturus Dec 2, 2024

macchiati Dec 2, 2024

conradarcturus Dec 2, 2024

macchiati Dec 2, 2024 •

edited

Loading

macchiati commented Dec 2, 2024

macchiati commented Dec 2, 2024

		<language type="cr">Cree</language>
		<language type="cr" alt="long">Woods Cree</language>

	<language type="cr">Cree</language>
	<language type="cr" alt="long">Woods Cree</language>
	<language type="cr">Woods Cree</language>

	public static class XLocaleMatchValue extends LocaleMatchValue {
	public static class LocaleOrDeprecatedLocaleMatchValue extends LocaleMatchValue {

	return "validity/locale-for-likely";
	return "validity/locale-or-deprecated-locale";

CLDR-18129 Investigate and fix (where necessary) invalid codes #4215

CLDR-18129 Investigate and fix (where necessary) invalid codes #4215

Conversation

macchiati commented Nov 27, 2024 • edited Loading

conradarcturus left a comment

Choose a reason for hiding this comment

srl295 commented Nov 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati Nov 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conradarcturus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

macchiati commented Dec 2, 2024

macchiati commented Dec 2, 2024

macchiati commented Nov 27, 2024 •

edited

Loading

macchiati Nov 29, 2024 •

edited

Loading

macchiati Nov 30, 2024 •

edited

Loading

macchiati Dec 2, 2024 •

edited

Loading