Skip to content

Fix fallback with language-likely script but region-unlikely script#7857

Open
sffc wants to merge 2 commits intounicode-org:mainfrom
sffc:fallback-alg-improvement
Open

Fix fallback with language-likely script but region-unlikely script#7857
sffc wants to merge 2 commits intounicode-org:mainfrom
sffc:fallback-alg-improvement

Conversation

@sffc
Copy link
Copy Markdown
Member

@sffc sffc commented Apr 8, 2026

Discovered when working on #3287

Changelog

icu_locale: fix fallback with language-likely script but region-unlikely script, such as sr-Cyrl-ME and zh-Hans-HK

@sffc sffc requested review from dminor and zbraniecki as code owners April 8, 2026 19:17
@sffc sffc requested review from Manishearth and robertbastian and removed request for dminor and zbraniecki April 8, 2026 19:17
@sffc sffc force-pushed the fallback-alg-improvement branch from 9a3657e to b997428 Compare April 8, 2026 19:19
@sffc
Copy link
Copy Markdown
Member Author

sffc commented Apr 8, 2026

I think there's still more that can be done to prevent loading the same likely subtags data twice and/or prevent loading likely subtags data when it isn't needed, but let's fix the algorithm first.

@sffc sffc requested a review from a team as a code owner April 8, 2026 20:30
datetime/names/month/buddhist/v1, sr-ME/3, -> sr-Latn-XK/3
datetime/names/month/buddhist/v1, sr-ME/3s, -> sr-Latn-XK/3
datetime/names/month/buddhist/v1, sr-XK/3, 132B, 102B, 28cb8a675f91e27b
datetime/names/month/buddhist/v1, sr-XK/3s, -> sr-XK/3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: changing the fallback algorithm might lead to unexpected results with data that was deduplicated under the previous algorithm, as well as with old binaries that are given data that was deduplicated under the new algorithm

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. We're adding new locales but I don't think we are removing old ones, so fallback should either hit the same thing as before or hit a better thing.

Copy link
Copy Markdown
Member

@robertbastian robertbastian Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The language fallback chains changed like:

  • sr-Cyrl-ME, sr-Cyrl to sr-Cyrl-ME, sr
  • zh-Hans-TW, zh-Hans to zh-Hans-TW, zh

This means that previously, if data was in sr-Cyrl-ME and it matched the data in sr-Cyrl, it would be removed, and we'd fall back to sr-Cyrl data at runtime. But now

  • Old data new code: sr-Cyrl-ME will fall back to sr instead at runtime, even though the data was deduplicated against sr-Cyrl
  • New data old code: sr-Cyrl_ME will fall back to sr-Cyrl at runtime, even though data was deduplicated against sr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, because sr-Cyrl and zh-Hans shouldn't contain any data (they're the default scripts), the old logic wouldn't have done any deduplication. But the new logic does do deduplication, so new data old code will break because the fallback will never reach sr/zh where the data is now.

datetime/patterns/date/buddhist/v1, <lookup>, 6444B, 1300 identifiers
datetime/patterns/date/buddhist/v1, <total>, 67336B, 47978B, 650 unique payloads
datetime/patterns/date/buddhist/v1, <lookup>, 6514B, 1313 identifiers
datetime/patterns/date/buddhist/v1, <total>, 67911B, 48345B, 657 unique payloads
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are there more unique payloads than before?

Copy link
Copy Markdown
Member Author

@sffc sffc Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we were dropping locales like zh-Hans-HK and sr-Cyrl-ME, and now we include them.

https://github.com/unicode-org/cldr/blob/main/common/main/sr_Cyrl_ME.xml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't "drop" locales in datagen, we deduplicate against parents. I still fail to understand how a change to the fallback algorithm can increase the number of unique data structs.

Copy link
Copy Markdown
Member Author

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new locales appear to be:

  • zh-Hans-HK
  • zh-Hans-MO
  • sr-Cyrl-ME
  • ku-Latn-IQ
  • yue-Hant-CN

@robertbastian
Copy link
Copy Markdown
Member

Please add a comment somewhere, either this PR or the issue, what the behaviour change here actually is, not just which locales are affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants