Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare mktables for Unicode 15.1 and 16.0 #23133

Open
wants to merge 8 commits into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

perldelta not needed until the actual releases are incorporated.

  • This set of changes does not require a perldelta entry.

if (defined (my $bmg = property_ref('Bidi_Mirroring_Glyph'))) {
$bmg->set_to_output_map($EXTERNAL_MAP);
$bmg->set_range_size_1(1);
}

property_ref('Numeric_Value')->set_to_output_map($OUTPUT_ADJUSTED);

# These two properties have no short names and the file names for them
# clash in DOS 8.3. Work around this by creating shorter file names that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we still limited by 8.3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On IRC the other day, I asked if we were still limited, and the answer was yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unicode filenames yes, but for ASCII filenames we don't AFAIK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave this as-is, since it is trivial to do, just in case. And I have WIP which should get rid of them altogether.

Copy link
Contributor

@jkeenan jkeenan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message for aa6faba has 2 misspellings. infrastructue lacks the second r. In incoroporated the second o needs removal.

@jkeenan
Copy link
Contributor

jkeenan commented Apr 1, 2025

This p.r. for Unicode mktables did not make it into the March 20 dev release. Does that mean we have to defer it to the 5.43 dev cycle?

@Leont
Copy link
Contributor

Leont commented Apr 1, 2025

This p.r. for Unicode mktables did not make it into the March 20 dev release. Does that mean we have to defer it to the 5.43 dev cycle?

The change isn't really user visible, it would only affect people who would want to patch in a more recent Unicode version.

Copy link
Contributor

@jkeenan jkeenan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khwilliamson there's one unresolved conversation in this p.r. If you mark that resolved, then I think this is okay to merge.

@khwilliamson
Copy link
Contributor Author

There are more commits coming

@khwilliamson khwilliamson marked this pull request as draft April 3, 2025 00:27
Add comments, and rewrap comment lines to fit 80 columns
Unicode 15.1 introduces this new property, which needs the same special
handling as plain NFKC_Casefold does.
These files are changed in 15.1 to have @missings lines, whereas they
didn't before.  This leads to some warnings messages, so turn off
looking at them, as we do for a number of other files.
We handle it by ignoring this file, new to Unicode 16.0.

It consists of lists of characters that, to put it less delicately than
Unicode would like, they regret creating.

But there are no rules associated with them.  It would be nice to have a
\p{DoNotEmit} property so that applications could handle situations
where this occurs.  But I'm fearful that if we did something like this,
that Unicode would later come up with something that had the same
intention but would be subtly or unsubtly different.

That has happened before, to our detriment.

So I think we should wait to see what they do do, in future releases.
@khwilliamson khwilliamson marked this pull request as ready for review April 7, 2025 23:41
This includes several new properties, some of which are considered
"provisional" by Unicode, which means they can be heavily revised or
withdrawn.

These properties are designed for use by scholars of hieroglyphics.
These new properties are automatically handled, but there is a problem.
They have no short form names.  Files are written for them based on
their names, and those files are not distinguishable on a DOS 8.3 file
system.  The solution here is to manually override the automatically
generated file names with distinguishable ones.
mktables does a lot of sanity checks on the data it gets fed.  One of
those is to make sure any \d group of code points is 10 long.  This
verifies that Unicode has given us enough code points to form 0-9.  It
assumes that if it got this much right, that their numeric values are
also 0-9.  This check has uncovered issues with the Unicode Standard in
the past.

Nowadays, they've cleaned up their act, and it's been many releases
since there has been problems.  But our checks remain, and I think they
should.

What happens in Unicode 16.0 was there was a range of \d characters that
contain two consecutive groups of 0-9 values.  The check could be
changed to verify that the count is divisible by 10, but checking for
this particular range is a bit safer.
There is already this method for lists of Ranges, so this is is just so
callers don't need to know which they are operating on.
@khwilliamson
Copy link
Contributor Author

This has been repushed, with the new hieroglyphic properties now working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants