Skip to content

Commit 8f58648

Browse files
committed
mktables: Consolidate code into a single function
Some properties in Unicode essentially form equivalence classes for all possible code points. For example, Unicode publishes the Line Break (LB) property, where each possible code point is given a type, like Alphabetic, or Opening Parenthesis. All code points that act as alphabetics have the AL equivalence class. All that act like Opening Parentheses have the OP class. Unicode also publishes rules as to if it is permissible to break between code point of any types. For the Line Break property, you wouldn't break a line between two alphabetics or between an opening parenthesis and an alphabetic, but you could between a Space and almost any other type or between a closing parenthesis and many types. Perl uses these properties to implement the \b{lb} etc regular expression constructs. It uses a two-dimensional array where the value in the cell [x,y] tells whether a break is permissible between characters of type x and characters of type y. (Some cases can't be done with this simple lookup, but knowing the surrounding context is necessary to make a decision. Those are implemented as DFAs in regexec.c.) Unicode used to publish such an array for the Line Break property, and still publishes some non-normative .html files that contain similar information. But to really know what to do, one has to read documents UAX#14 and UAX#29 that contain textual descriptions of the rules. These change each new release, and are the major pain in upgrading to a new release. In recent releases, Unicode has mostly stopped creating new equivalence classes as it has refined the rules for the boundary conditions For example, the line boundary conditions are very different for East Asian (EA) characters than the Western scripts. Effectively there are thus two sets of rules. But instead of creating new equivalence classes that reflect this reality, Unicode has chosen to just document it in those two UAX documents. I don't know the motivation for this. But perl wants that table to divvy up all the possible boundary conditions, so it can continue to use the array to make most of the decisions, so mktables splits the equivalence classes that Unicode provides into new ones that reflect what the UAXes say. At first, I thought this was a one-off matter, so wrote a few lines to handle a special case; then when the next release came out, added a few more for another one, etc. But Unicode 15.1 and 16.0 continue the trend, so it's become an effort. This commit consolidates the previous one-off code snippets into one generalized function. It should be able to handle future instances without having to craft something new each time. It also creates a new data structure that mk_invlists.pl can look at so that it doesn't have to repeat the logic found here, as it currently does.
1 parent 5b52ed1 commit 8f58648

File tree

6 files changed

+328
-78
lines changed

6 files changed

+328
-78
lines changed

charclass_invlists.inc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -436055,7 +436055,7 @@ static const U8 WB_table[23][23] = {
436055436055
* 3f4f32ed2a577344a508114527e721d7a8b633d32f38945d47fe0c743650c585 lib/unicore/extracted/DLineBreak.txt
436056436056
* 710abf2d581ac9c57f244c0834f9d9969d9781e0396adccd330eaae658ac7d6b lib/unicore/extracted/DNumType.txt
436057436057
* 6bd30f385f3baf3ab5d5308c111a81de87bea5f494ba0ba69e8ab45263b8c34d lib/unicore/extracted/DNumValues.txt
436058-
* 4b2ad6e7689bea5acec1b52fa813a60fdac125a5cc6901cc02be3093b1697894 lib/unicore/mktables
436058+
* b8f82c95893d1c0b15485624d30c0ae867e8921aa556d2d95d7018aa7292b2c3 lib/unicore/mktables
436059436059
* 55d90fdc3f902e5c0b16b3378f9eaa36e970a1c09723c33de7d47d0370044012 lib/unicore/version
436060436060
* 0a6b5ab33bb1026531f816efe81aea1a8ffcd34a27cbea37dd6a70a63d73c844 regen/charset_translations.pl
436061436061
* c7ff8e0d207d3538c7feb4a1a152b159e5e902d20293b303569ea8323e84633e regen/mk_PL_charclass.pl

0 commit comments

Comments
 (0)