-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: refer to code points directly by name/number instead of using aliases #3310
base: main
Are you sure you want to change the base?
Conversation
54acd44
to
ce3e176
Compare
@@ -588,7 +588,7 @@ <h1>Terminal Symbols</h1> | |||
<p>In contrast, in the syntactic grammar, a contiguous run of fixed-width code points is a single terminal symbol.</p> | |||
<p>Terminal symbols come in two other forms:</p> | |||
<ul> | |||
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<ABBREV>" where "ABBREV" is a mnemonic for the code point or set of code points. These forms are defined in <emu-xref href="#sec-unicode-format-control-characters" title></emu-xref>, <emu-xref href="#sec-white-space" title></emu-xref>, and <emu-xref href="#sec-line-terminators" title></emu-xref>.</li> | |||
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems gratuitously divergent from Unicode conventions. Should we instead try to align?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should use small caps? As for the name, I chose to use one of the official aliases when I felt it was more appropriate/descriptive. I can explicitly state that it is the code point name or an alias if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should use small caps and avoid brackets except for sequences, e.g.
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li> | |
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "U+0000 <small class="code-point-name">NULL</small>" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li> |
or maybe ecmarkup support
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li> | |
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<code data-char-name="NULL">U+0000</code>" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li> |
or even ecmarkdown
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li> | |
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "U+0000 ^^NULL^^" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to defer the small-caps names (with possible tooling support) to a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so. 👍
</emu-grammar> | ||
<emu-note> | ||
<p>Other than for some of the code points listed as explicit alternatives in |WhiteSpace|, |WhiteSpace| intentionally excludes <a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BWhite_Space%7D%26%5Cp%7BGeneral_Category%21%3DSpace_Separator%7D%5D">all code points that have the Unicode “White_Space” property but which are not classified in general category “Space_Separator” (“Zs”)</a>.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link is good, but I still think an explicit mention of U+0085 (NEXT LINE) and probably also U+FEFF (ZERO WIDTH NO-BREAK SPACE) would be better. As observed in tc39/proposal-regexp-v-flag#37, the classification of these two code points is easy to overlook, and IMO it behooves the spec to highlight that.
Note also that a) https://util.unicode.org/UnicodeJsps is frequently unavailable, and in the recent past was offline for months, and b) even when it is available, there is no obvious indication that 7 of the 8 code points are included in ECMA-262 |LineTerminator| (and thus regular expression pattern \s
, which exactly covers the union of |WhiteSpace| and |LineTerminator|) and 1 in the middle is not.
sample output
Basic Latin — C0 controls
| ||
U+0009 | CHARACTER TABULATION; HORIZONTAL TABULATION; HT; TAB | |
U+000A | END OF LINE; EOL; LF; LINE FEED; NEW LINE; NL | |
� | U+000B | LINE TABULATION; VERTICAL TABULATION; VT |
U+000C | FF; FORM FEED | |
U+000D | CARRIAGE RETURN; CR | |
Latin 1 Supplement — C1 controls
| ||
� | U+0085 | NEL; NEXT LINE |
General Punctuation — Separators
| ||
U+2028 | LINE SEPARATOR | |
U+2029 | PARAGRAPH SEPARATOR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything else LGTM.
</th> | ||
<th> | ||
Code Unit Value | ||
|SingleEscapeCharacter| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that a |SingleEscapeCharacter|
is a single character and does not include the preceding backslash. So it's a mismatch to have |SingleEscapeCharacter|
as the column head and then (e.g.) \b
below it. (The status quo uses "Escape Sequence" as the column head, which is not a defined term. You'd have to go up to |DoubleStringCharacter|
and |SingleStringCharacter|
to get a nonterminal that actually includes the backslash.)
The simplest fix would be to delete the backslashes from the data cells (as in Table 61: ControlEscape Code Point Values
), although that loses the visual cue that they're 'escape sequences'.
Alternatively, you could insert a backslash into the header cell, but that's a bit dodgy, since:
`\` |SingleEscapeCharacter|
doesn't occur in the grammar, and- the prose
associated with |SingleEscapeCharacter|
wouldn't be quite right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just remove the backslash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I'll replace it with a code point descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's valid, but I'm not sure it's an improvement (over just removing the backslashes). The definition of SingleEscapeCharacter is one of ' " \ b f n r t v, so it seems like the natural approach would be to use those characters rather than code point descriptors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with changing to just the single character if the other editors prefer.
... because it's talking about a string literal in *Java*. See tc39#3310 (comment)
... because it's talking about a string literal in *Java*. See tc39#3310 (comment)
... because it's talking about a string literal in *Java*. See tc39#3310 (comment)
) ... because it's talking about a string literal in *Java*. See tc39#3310 (comment)
) ... because it's talking about a string literal in *Java*. See tc39#3310 (comment)
Fixes #2930.