[lex.charset] Introduce parent subclause [lex.char] for character sets and UCNs (#7067)

AlisdairM · web-flow · commit 6338d95ae620 · 2024-10-16T12:47:02.000Z
The grammar for universal-character-name is oddly sandwiched into the
middle of the subclause talking about the different character sets used
by the standard.  To improve the flow, extract that grammar into its own
subclause.

In the extraction, I make three other clarifying changes.  First, describe
this new subclause as 'a way to name any element of the of the translation
character set using just the basic character set' rather than simply
'a way to name other characters'. Then, merge the sentence on where universal
characters are prohibited into the new intro sentence describing universal
characters, to make clear that there is no contradiction between nominating
a character, and how that character can be used. Finally, remove the 'one of'
in the grammar where there is only one option to choose.
diff --git a/source/lex.tex b/source/lex.tex
@@ -249,7 +249,9 @@
 \indextext{translation!phases|)}
 \end{enumerate}
 
-\rSec1[lex.charset]{Character sets}
+\rSec1[lex.char]{Characters}%
+
+\rSec2[lex.charset]{Character sets}
 
 \pnum
 \indextext{character set|(}%
@@ -326,11 +328,69 @@
 \end{floattable}
 
 \pnum
-The \grammarterm{universal-character-name} construct provides a way to name
-other characters.
+The \defnadj{basic literal}{character set} consists of
+all characters of the basic character set,
+plus the control characters specified in \tref{lex.charset.literal}.
+
+\begin{floattable}{Additional control characters in the basic literal character set}{lex.charset.literal}{ll}
+\topline
+\ohdrx{2}{character} \\ \capsep
+\ucode{0000} & \uname{null} \\
+\ucode{0007} & \uname{alert} \\
+\ucode{0008} & \uname{backspace} \\
+\ucode{000d} & \uname{carriage return} \\
+\end{floattable}
+
+\pnum
+A \defn{code unit} is an integer value
+of character type\iref{basic.fundamental}.
+Characters in a \grammarterm{character-literal}
+other than a multicharacter or non-encodable character literal or
+in a \grammarterm{string-literal} are encoded as
+a sequence of one or more code units, as determined
+by the \grammarterm{encoding-prefix}\iref{lex.ccon,lex.string};
+this is termed the respective \defnadj{literal}{encoding}.
+The \defnadj{ordinary literal}{encoding} is
+the encoding applied to an ordinary character or string literal.
+The \defnadj{wide literal}{encoding} is the encoding applied
+to a wide character or string literal.
+
+\pnum
+A literal encoding or a locale-specific encoding of one of
+the execution character sets\iref{character.seq}
+encodes each element of the basic literal character set as
+a single code unit with non-negative value,
+distinct from the code unit for any other such element.
+\begin{note}
+A character not in the basic literal character set
+can be encoded with more than one code unit;
+the value of such a code unit can be the same as
+that of a code unit for an element of the basic literal character set.
+\end{note}
+\indextext{character!null}%
+\indextext{wide-character!null}%
+The \unicode{0000}{null} character is encoded as the value \tcode{0}.
+No other element of the translation character set
+is encoded with a code unit of value \tcode{0}.
+The code unit value of each decimal digit character after the digit \tcode{0} (\ucode{0030})
+shall be one greater than the value of the previous.
+The ordinary and wide literal encodings are otherwise
+\impldef{ordinary and wide literal encodings}.
+\indextext{UTF-8}%
+\indextext{UTF-16}%
+\indextext{UTF-32}%
+For a UTF-8, UTF-16, or UTF-32 literal,
+the implementation shall encode
+the Unicode scalar value
+corresponding to each character of the translation character set
+as specified in the Unicode Standard
+for the respective Unicode encoding form.
+\indextext{character set|)}
+
+\rSec2[lex.universal.char]{Universal character names}
 
 \begin{bnf}
-\nontermdef{n-char} \textnormal{one of}\br
+\nontermdef{n-char}\br
      \textnormal{any member of the translation character set except the \unicode{007d}{right curly bracket} or new-line character}
 \end{bnf}
 
@@ -364,6 +424,22 @@
     named-universal-character
 \end{bnf}
 
+\pnum
+The \grammarterm{universal-character-name} construct provides a way to name any
+element in the translation character set using just the basic character set.
+If a \grammarterm{universal-character-name} outside
+the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
+\grammarterm{r-char-sequence} of a \grammarterm{character-literal} or
+\grammarterm{string-literal}
+(in either case, including within a \grammarterm{user-defined-literal})
+corresponds to a control character or to a character in the basic character set,
+the program is ill-formed.
+\begin{note}
+A sequence of characters resembling a \grammarterm{universal-character-name} in an
+\grammarterm{r-char-sequence}\iref{lex.string} does not form a
+\grammarterm{universal-character-name}.
+\end{note}
+
 \pnum
 A \grammarterm{universal-character-name}
 of the form \tcode{\textbackslash u} \grammarterm{hex-quad},
@@ -391,80 +467,6 @@
 None of these names or aliases have leading or trailing spaces.
 \end{note}
 
-\pnum
-If a \grammarterm{universal-character-name} outside
-the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
-\grammarterm{r-char-sequence} of
-a \grammarterm{character-literal} or \grammarterm{string-literal}
-(in either case, including within a \grammarterm{user-defined-literal})
-corresponds to a control character or
-to a character in the basic character set, the program is ill-formed.
-\begin{note}
-A sequence of characters resembling a \grammarterm{universal-character-name} in an
-\grammarterm{r-char-sequence}\iref{lex.string} does not form a
-\grammarterm{universal-character-name}.
-\end{note}
-
-\pnum
-The \defnadj{basic literal}{character set} consists of
-all characters of the basic character set,
-plus the control characters specified in \tref{lex.charset.literal}.
-
-\begin{floattable}{Additional control characters}{lex.charset.literal}{ll}
-\topline
-\ohdrx{2}{character} \\ \capsep
-\ucode{0000} & \uname{null} \\
-\ucode{0007} & \uname{alert} \\
-\ucode{0008} & \uname{backspace} \\
-\ucode{000d} & \uname{carriage return} \\
-\end{floattable}
-
-\pnum
-A \defn{code unit} is an integer value
-of character type\iref{basic.fundamental}.
-Characters in a \grammarterm{character-literal}
-other than a multicharacter or non-encodable character literal or
-in a \grammarterm{string-literal} are encoded as
-a sequence of one or more code units, as determined
-by the \grammarterm{encoding-prefix}\iref{lex.ccon,lex.string};
-this is termed the respective \defnadj{literal}{encoding}.
-The \defnadj{ordinary literal}{encoding} is
-the encoding applied to an ordinary character or string literal.
-The \defnadj{wide literal}{encoding} is the encoding applied
-to a wide character or string literal.
-
-\pnum
-A literal encoding or a locale-specific encoding of one of
-the execution character sets\iref{character.seq}
-encodes each element of the basic literal character set as
-a single code unit with non-negative value,
-distinct from the code unit for any other such element.
-\begin{note}
-A character not in the basic literal character set
-can be encoded with more than one code unit;
-the value of such a code unit can be the same as
-that of a code unit for an element of the basic literal character set.
-\end{note}
-\indextext{character!null}%
-\indextext{wide-character!null}%
-The \unicode{0000}{null} character is encoded as the value \tcode{0}.
-No other element of the translation character set
-is encoded with a code unit of value \tcode{0}.
-The code unit value of each decimal digit character after the digit \tcode{0} (\ucode{0030})
-shall be one greater than the value of the previous.
-The ordinary and wide literal encodings are otherwise
-\impldef{ordinary and wide literal encodings}.
-\indextext{UTF-8}%
-\indextext{UTF-16}%
-\indextext{UTF-32}%
-For a UTF-8, UTF-16, or UTF-32 literal,
-the implementation shall encode
-the Unicode scalar value
-corresponding to each character of the translation character set
-as specified in the Unicode Standard
-for the respective Unicode encoding form.
-\indextext{character set|)}
-
 \rSec1[lex.pptoken]{Preprocessing tokens}
 
 \indextext{token!preprocessing|(}%