Skip to content

Implement core|extended|outlier for display names #7831

@sffc

Description

@sffc

Split from #3260

We should implement the three-tiered approach for display name slicing.

Rough bucket definitions:

  • Core: Regions with a direct linguistic or official tie to the locale. (e.g., CA is Core for both en and fr).
  • Extended: Modern, active geopolitical entities that do not share a primary linguistic tie with the locale.
  • Outlier: Legacy codes (like the USSR), transitional regions, or highly specialized administrative areas not used in standard modern address/display logic.

Important: I am told by @macchiati that CLDR already has this data in the form of coverage levels. Not every region has the same coverage requirement for every locale.

Example: The table below organizes regions into Core (primary language/association), Extended (modern standard), and Outlier (legacy/provisional) buckets based on the specific locale. This table is for illustration purposes only! In practice we probably want ES and FR to be translated in both es and fr, for example.

Region (Code) English (en) Spanish (es) French (fr) Chinese (zh)
United States (US) Core Extended Extended Extended
Spain (ES) Extended Core Extended Extended
France (FR) Extended Extended Core Extended
Canada (CA) Core Extended Core Extended
Soviet Union (SU) Outlier Outlier Outlier Outlier

There was some skepticism about the need for Core vs Extended, for example by @robertbastian in #3260 (comment), @hsivonen in #3260 (comment).

Some considerations:

  • We should definitely make this "minimal" slice be the non-default behavior. For example: new_minimal.
  • CLDR has the Core|Extended distinction in the form of coverage levels. For example, it requires all EU regions to be translated into all EU languages. We can start by inheriting that.
  • If using the minimal slicing and a name is missing, we should load the name in the primary language (perhaps by likely subtags). It's already the case that some language pickers display languages in their native language, not the UI language. We can fiddle with datagen to make sure that this string is always present.
  • We could leave this for later. However, at least Extended|Outlier seems uncontroversial, so a lot of the work is already going to be done, and I would rather ship a Minimal solution to help us get more feedback on it. Display names are the #-1 largest single piece of data in ICU4C (second is probably unit display names), and I really want ICU4X to tread new territory when it comes to optimizing them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-dnamesComponent: Language/Region/... Display Names

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions