Skip to content

Commit df16469

Browse files
committed
[lex.charset] Extract universal-character-name grammar to new subclause
The grammar for universal-character-name is oddly sandwiched into the middle of the subcluase talking about the different character sets used by the standard. To improve the flow, extract that grammar into its own subclause. In the extraction, I make two other clarifying changes. First, describe this new subclause as 'a way to name any element of the of the tranlation character set using just the basic character set' rather than simply 'a way to name other characters'. Secondly, remove the 'one of' in the grammar where there is only one option to choose.
1 parent 3680e10 commit df16469

File tree

1 file changed

+65
-63
lines changed

1 file changed

+65
-63
lines changed

source/lex.tex

Lines changed: 65 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -320,11 +320,69 @@
320320
\end{floattable}
321321

322322
\pnum
323-
The \grammarterm{universal-character-name} construct provides a way to name
324-
other characters.
323+
The \defnadj{basic literal}{character set} consists of
324+
all characters of the basic character set,
325+
plus the control characters specified in \tref{lex.charset.literal}.
326+
327+
\begin{floattable}{Additional control characters in the basic literal character set}{lex.charset.literal}{ll}
328+
\topline
329+
\ohdrx{2}{character} \\ \capsep
330+
\ucode{0000} & \uname{null} \\
331+
\ucode{0007} & \uname{alert} \\
332+
\ucode{0008} & \uname{backspace} \\
333+
\ucode{000d} & \uname{carriage return} \\
334+
\end{floattable}
335+
336+
\pnum
337+
A \defn{code unit} is an integer value
338+
of character type\iref{basic.fundamental}.
339+
Characters in a \grammarterm{character-literal}
340+
other than a multicharacter or non-encodable character literal or
341+
in a \grammarterm{string-literal} are encoded as
342+
a sequence of one or more code units, as determined
343+
by the \grammarterm{encoding-prefix}\iref{lex.ccon,lex.string};
344+
this is termed the respective \defnadj{literal}{encoding}.
345+
The \defnadj{ordinary literal}{encoding} is
346+
the encoding applied to an ordinary character or string literal.
347+
The \defnadj{wide literal}{encoding} is the encoding applied
348+
to a wide character or string literal.
349+
350+
\pnum
351+
A literal encoding or a locale-specific encoding of one of
352+
the execution character sets\iref{character.seq}
353+
encodes each element of the basic literal character set as
354+
a single code unit with non-negative value,
355+
distinct from the code unit for any other such element.
356+
\begin{note}
357+
A character not in the basic literal character set
358+
can be encoded with more than one code unit;
359+
the value of such a code unit can be the same as
360+
that of a code unit for an element of the basic literal character set.
361+
\end{note}
362+
\indextext{character!null}%
363+
\indextext{wide-character!null}%
364+
The \unicode{0000}{null} character is encoded as the value \tcode{0}.
365+
No other element of the translation character set
366+
is encoded with a code unit of value \tcode{0}.
367+
The code unit value of each decimal digit character after the digit \tcode{0} (\ucode{0030})
368+
shall be one greater than the value of the previous.
369+
The ordinary and wide literal encodings are otherwise
370+
\impldef{ordinary and wide literal encodings}.
371+
\indextext{UTF-8}%
372+
\indextext{UTF-16}%
373+
\indextext{UTF-32}%
374+
For a UTF-8, UTF-16, or UTF-32 literal,
375+
the implementation shall encode
376+
the Unicode scalar value
377+
corresponding to each character of the translation character set
378+
as specified in the Unicode Standard
379+
for the respective Unicode encoding form.
380+
\indextext{character set|)}
381+
382+
\rSec1[lex.universal.char]{Universal Character Names}
325383

326384
\begin{bnf}
327-
\nontermdef{n-char} \textnormal{one of}\br
385+
\nontermdef{n-char}\br
328386
\textnormal{any member of the translation character set except the \unicode{007d}{right curly bracket} or new-line character}
329387
\end{bnf}
330388

@@ -358,6 +416,10 @@
358416
named-universal-character
359417
\end{bnf}
360418

419+
\pnum
420+
The \grammarterm{universal-character-name} construct provides a way to name
421+
any element in the translation character set using just the basic character set.
422+
361423
\pnum
362424
A \grammarterm{universal-character-name}
363425
of the form \tcode{\textbackslash u} \grammarterm{hex-quad},
@@ -399,66 +461,6 @@
399461
\grammarterm{universal-character-name}.
400462
\end{note}
401463

402-
\pnum
403-
The \defnadj{basic literal}{character set} consists of
404-
all characters of the basic character set,
405-
plus the control characters specified in \tref{lex.charset.literal}.
406-
407-
\begin{floattable}{Additional control characters in the basic literal character set}{lex.charset.literal}{ll}
408-
\topline
409-
\ohdrx{2}{character} \\ \capsep
410-
\ucode{0000} & \uname{null} \\
411-
\ucode{0007} & \uname{alert} \\
412-
\ucode{0008} & \uname{backspace} \\
413-
\ucode{000d} & \uname{carriage return} \\
414-
\end{floattable}
415-
416-
\pnum
417-
A \defn{code unit} is an integer value
418-
of character type\iref{basic.fundamental}.
419-
Characters in a \grammarterm{character-literal}
420-
other than a multicharacter or non-encodable character literal or
421-
in a \grammarterm{string-literal} are encoded as
422-
a sequence of one or more code units, as determined
423-
by the \grammarterm{encoding-prefix}\iref{lex.ccon,lex.string};
424-
this is termed the respective \defnadj{literal}{encoding}.
425-
The \defnadj{ordinary literal}{encoding} is
426-
the encoding applied to an ordinary character or string literal.
427-
The \defnadj{wide literal}{encoding} is the encoding applied
428-
to a wide character or string literal.
429-
430-
\pnum
431-
A literal encoding or a locale-specific encoding of one of
432-
the execution character sets\iref{character.seq}
433-
encodes each element of the basic literal character set as
434-
a single code unit with non-negative value,
435-
distinct from the code unit for any other such element.
436-
\begin{note}
437-
A character not in the basic literal character set
438-
can be encoded with more than one code unit;
439-
the value of such a code unit can be the same as
440-
that of a code unit for an element of the basic literal character set.
441-
\end{note}
442-
\indextext{character!null}%
443-
\indextext{wide-character!null}%
444-
The \unicode{0000}{null} character is encoded as the value \tcode{0}.
445-
No other element of the translation character set
446-
is encoded with a code unit of value \tcode{0}.
447-
The code unit value of each decimal digit character after the digit \tcode{0} (\ucode{0030})
448-
shall be one greater than the value of the previous.
449-
The ordinary and wide literal encodings are otherwise
450-
\impldef{ordinary and wide literal encodings}.
451-
\indextext{UTF-8}%
452-
\indextext{UTF-16}%
453-
\indextext{UTF-32}%
454-
For a UTF-8, UTF-16, or UTF-32 literal,
455-
the implementation shall encode
456-
the Unicode scalar value
457-
corresponding to each character of the translation character set
458-
as specified in the Unicode Standard
459-
for the respective Unicode encoding form.
460-
\indextext{character set|)}
461-
462464
\rSec1[lex.pptoken]{Preprocessing tokens}
463465

464466
\indextext{token!preprocessing|(}%

0 commit comments

Comments
 (0)