Skip to content

Commit

Permalink
Line breaking changes from UTC-181 (#1046)
Browse files Browse the repository at this point in the history
* UTC-181-A44 In LineBreak.txt and derived files, change the Line_Break assignment of U+034F COMBINING GRAPHEME JOINER from Line_Break=GL (Glue) to Line_Break=CM (Combining_Mark). For Unicode Version 17.0. [Ref. Section 6.3 of document L2/24-224]

* Regenerate UCD

* UTC-181-A142 In UCD files LineBreakTest.txt and LineBreakTest.html, add realistic tests exercising the changes to the behaviour of rules LB20a and LB21. For Unicode Version 17.0. See L2/24-224 item 6.1.

* LB20a does not work in SP CM HY HL 😿

* Regenerate UCD

* UTC-181-A138 In UCD file PropertyValueAliases.txt, add a new Line_Break property value Unambiguous_Hyphen (short alias: HH). For Unicode Version 17.0. See L2/24-224 item 6.1.

* Regenerate UCD

* GenerateEnums

* UTC-181-A139 In UCD file LineBreak.txt and derived files, assign Line_Break=Unambiguous_Hyphen to the eleven characters that have General_Category=Pd and Line_Break=Break_After in Unicode Version 16.0. For Unicode Version 17.0. See L2/24-224 item 6.1.

* UTC-181-A141 In UCD files LineBreakTest.txt and LineBreakTest.html, update rules LB12a, LB20a, LB21, and LB21a as described in L2/24-224 item 6.1. For Unicode Version 17.0.

* Regenerate UCD
  • Loading branch information
eggrobin authored Feb 17, 2025
1 parent ed8b6dc commit bca50a4
Show file tree
Hide file tree
Showing 10 changed files with 2,035 additions and 1,880 deletions.
26 changes: 12 additions & 14 deletions unicodetools/data/ucd/dev/LineBreak.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# LineBreak-17.0.0.txt
# Date: 2025-01-27, 18:09:16 GMT
# Date: 2025-02-14, 15:13:07 GMT
# © 2025 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -157,9 +157,7 @@
02ED ; AL # Sk MODIFIER LETTER UNASPIRATED
02EE ; AL # Lm MODIFIER LETTER DOUBLE APOSTROPHE
02EF..02FF ; AL # Sk [17] MODIFIER LETTER LOW DOWN ARROWHEAD..MODIFIER LETTER LOW LEFT ARROW
0300..034E ; CM # Mn [79] COMBINING GRAVE ACCENT..COMBINING UPWARDS ARROW BELOW
034F ; GL # Mn COMBINING GRAPHEME JOINER
0350..035B ; CM # Mn [12] COMBINING RIGHT ARROWHEAD ABOVE..COMBINING ZIGZAG ABOVE
0300..035B ; CM # Mn [92] COMBINING GRAVE ACCENT..COMBINING ZIGZAG ABOVE
035C..0362 ; GL # Mn [7] COMBINING DOUBLE BREVE BELOW..COMBINING DOUBLE RIGHTWARDS ARROW BELOW
0363..036F ; CM # Mn [13] COMBINING LATIN SMALL LETTER A..COMBINING LATIN SMALL LETTER X
0370..0373 ; AL # L& [4] GREEK CAPITAL LETTER HETA..GREEK SMALL LETTER ARCHAIC SAMPI
Expand Down Expand Up @@ -190,11 +188,11 @@
055A..055F ; AL # Po [6] ARMENIAN APOSTROPHE..ARMENIAN ABBREVIATION MARK
0560..0588 ; AL # Ll [41] ARMENIAN SMALL LETTER TURNED AYB..ARMENIAN SMALL LETTER YI WITH STROKE
0589 ; IS # Po ARMENIAN FULL STOP
058A ; BA # Pd ARMENIAN HYPHEN
058A ; HH # Pd ARMENIAN HYPHEN
058D..058E ; AL # So [2] RIGHT-FACING ARMENIAN ETERNITY SIGN..LEFT-FACING ARMENIAN ETERNITY SIGN
058F ; PR # Sc ARMENIAN DRAM SIGN
0591..05BD ; CM # Mn [45] HEBREW ACCENT ETNAHTA..HEBREW POINT METEG
05BE ; BA # Pd HEBREW PUNCTUATION MAQAF
05BE ; HH # Pd HEBREW PUNCTUATION MAQAF
05BF ; CM # Mn HEBREW POINT RAFE
05C0 ; AL # Po HEBREW PUNCTUATION PASEQ
05C1..05C2 ; CM # Mn [2] HEBREW POINT SHIN DOT..HEBREW POINT SIN DOT
Expand Down Expand Up @@ -667,7 +665,7 @@
1390..1399 ; AL # So [10] ETHIOPIC TONAL MARK YIZET..ETHIOPIC TONAL MARK KURT
13A0..13F5 ; AL # Lu [86] CHEROKEE LETTER A..CHEROKEE LETTER MV
13F8..13FD ; AL # Ll [6] CHEROKEE SMALL LETTER YE..CHEROKEE SMALL LETTER MV
1400 ; BA # Pd CANADIAN SYLLABICS HYPHEN
1400 ; HH # Pd CANADIAN SYLLABICS HYPHEN
1401..166C ; AL # Lo [620] CANADIAN SYLLABICS E..CANADIAN SYLLABICS CARRIER TTSA
166D ; AL # So CANADIAN SYLLABICS CHI SIGN
166E ; AL # Po CANADIAN SYLLABICS FULL STOP
Expand Down Expand Up @@ -899,9 +897,9 @@
200C ; CM # Cf ZERO WIDTH NON-JOINER
200D ; ZWJ# Cf ZERO WIDTH JOINER
200E..200F ; CM # Cf [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK
2010 ; BA # Pd HYPHEN
2010 ; HH # Pd HYPHEN
2011 ; GL # Pd NON-BREAKING HYPHEN
2012..2013 ; BA # Pd [2] FIGURE DASH..EN DASH
2012..2013 ; HH # Pd [2] FIGURE DASH..EN DASH
2014 ; B2 # Pd EM DASH
2015 ; AI # Pd HORIZONTAL BAR
2016 ; AI # Po DOUBLE VERTICAL LINE
Expand Down Expand Up @@ -1365,7 +1363,7 @@
2E0D ; QU # Pf RIGHT RAISED OMISSION BRACKET
2E0E..2E15 ; BA # Po [8] EDITORIAL CORONIS..UPWARDS ANCORA
2E16 ; AL # Po DOTTED RIGHT-POINTING ANGLE
2E17 ; BA # Pd DOUBLE OBLIQUE HYPHEN
2E17 ; HH # Pd DOUBLE OBLIQUE HYPHEN
2E18 ; OP # Po INVERTED INTERROBANG
2E19 ; BA # Po PALM BRANCH
2E1A ; AL # Pd HYPHEN WITH DIAERESIS
Expand Down Expand Up @@ -1393,7 +1391,7 @@
2E3A..2E3B ; B2 # Pd [2] TWO-EM DASH..THREE-EM DASH
2E3C..2E3E ; BA # Po [3] STENOGRAPHIC FULL STOP..WIGGLY VERTICAL LINE
2E3F ; AL # Po CAPITULUM
2E40 ; BA # Pd DOUBLE HYPHEN
2E40 ; HH # Pd DOUBLE HYPHEN
2E41 ; BA # Po REVERSED COMMA
2E42 ; OP # Ps DOUBLE LOW-REVERSED-9 QUOTATION MARK
2E43..2E4A ; BA # Po [8] DASH WITH LEFT UPTURN..DOTTED SOLIDUS
Expand All @@ -1412,7 +1410,7 @@
2E5A ; CP # Pe TOP HALF RIGHT PARENTHESIS
2E5B ; OP # Ps BOTTOM HALF LEFT PARENTHESIS
2E5C ; CP # Pe BOTTOM HALF RIGHT PARENTHESIS
2E5D ; BA # Pd OBLIQUE HYPHEN
2E5D ; HH # Pd OBLIQUE HYPHEN
2E80..2E99 ; ID # So [26] CJK RADICAL REPEAT..CJK RADICAL RAP
2E9B..2EF3 ; ID # So [89] CJK RADICAL CHOKE..CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5 ; ID # So [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
Expand Down Expand Up @@ -2812,14 +2810,14 @@ FFFD ; AI # So REPLACEMENT CHARACTER
10D4F ; AL # Lo GARAY SUKUN
10D50..10D65 ; AL # Lu [22] GARAY CAPITAL LETTER A..GARAY CAPITAL LETTER OLD NA
10D69..10D6D ; CM # Mn [5] GARAY VOWEL SIGN E..GARAY CONSONANT NASALIZATION MARK
10D6E ; BA # Pd GARAY HYPHEN
10D6E ; HH # Pd GARAY HYPHEN
10D6F ; AL # Lm GARAY REDUPLICATION MARK
10D70..10D85 ; AL # Ll [22] GARAY SMALL LETTER A..GARAY SMALL LETTER OLD NA
10D8E..10D8F ; AL # Sm [2] GARAY PLUS SIGN..GARAY MINUS SIGN
10E60..10E7E ; AL # No [31] RUMI DIGIT ONE..RUMI FRACTION TWO THIRDS
10E80..10EA9 ; AL # Lo [42] YEZIDI LETTER ELIF..YEZIDI LETTER ET
10EAB..10EAC ; CM # Mn [2] YEZIDI COMBINING HAMZA MARK..YEZIDI COMBINING MADDA MARK
10EAD ; BA # Pd YEZIDI HYPHENATION MARK
10EAD ; HH # Pd YEZIDI HYPHENATION MARK
10EB0..10EB1 ; AL # Lo [2] YEZIDI LETTER LAM WITH DOT ABOVE..YEZIDI LETTER YOT WITH CIRCUMFLEX ABOVE
10EC2..10EC4 ; AL # Lo [3] ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW..ARABIC LETTER KAF WITH TWO DOTS VERTICALLY BELOW
10EC5 ; AL # Lm ARABIC SMALL YEH BARREE WITH TWO DOTS BELOW
Expand Down
3 changes: 2 additions & 1 deletion unicodetools/data/ucd/dev/PropertyValueAliases.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# PropertyValueAliases-17.0.0.txt
# Date: 2025-01-27, 18:09:29 GMT
# Date: 2025-02-14, 15:50:28 GMT
# © 2025 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -1142,6 +1142,7 @@ lb ; EX ; Exclamation
lb ; GL ; Glue
lb ; H2 ; H2
lb ; H3 ; H3
lb ; HH ; Unambiguous_Hyphen
lb ; HL ; Hebrew_Letter
lb ; HY ; Hyphen
lb ; ID ; Ideographic
Expand Down
505 changes: 316 additions & 189 deletions unicodetools/data/ucd/dev/auxiliary/LineBreakTest.html

Large diffs are not rendered by default.

Loading

0 comments on commit bca50a4

Please sign in to comment.