Skip to content

Tweak Some Unicode-Related Text #1103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: draft-v8
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion standard/expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -1334,7 +1334,7 @@ An *interpolated_string_expression* consists of `$`, `$@`, or `@$`, immediately

Interpolated string expressions have two forms; regular (*interpolated_regular_string_expression*)
and verbatim (*interpolated_verbatim_string_expression*); which are lexically similar to, but differ semantically from, the two forms of string
literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)).
literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)). Both forms use UTF-16 encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this is correct. The interpolated string expression syntax in a code file leads to a System.IFormattable, System.FormattableString or string instance at runtime, and that string instance is presented as UTF-16

The syntax itself only exists in the code file and is therefore in whatever encoding the code file is in – UTF-8, ASCII, EBCDIC…

The same apply to string and character literals – they themselves can be in any encoding supported by the implementation for code files, while the values they produce at runtime are presented as UTF-16.

Offhand I’ve no alternative wording suggestion and so would just not make any change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax itself only exists in the code file and is therefore in whatever encoding the code file is in – UTF-8, ASCII, EBCDIC…

I'm not sure I'd say that - I'd expect us to treat the input file as a sequence of Unicode characters.

The first step in 6.1 is "Transformation, which converts a file from a particular character repertoire and encoding scheme into a sequence of Unicode characters."

I would hope nothing later would need to know about the original encoding, so by the time we have anything like a "string literal" it should just be a sequence of Unicode characters.

However, I agree with the proposal to just revert this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)). Both forms use UTF-16 encoding.
literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)).


```ANTLR
interpolated_string_expression
Expand Down
12 changes: 6 additions & 6 deletions standard/lexical-structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Conceptually speaking, a program is compiled using three steps:
1. Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens.
1. Syntactic analysis, which translates the stream of tokens into executable code.

Conforming implementations shall accept Unicode compilation units encoded with the UTF-8 encoding form (as defined by the Unicode standard), and transform them into a sequence of Unicode characters. Implementations can choose to accept and transform additional character encoding schemes (such as UTF-16, UTF-32, or non-Unicode character mappings).
Apart from accepting UTF-8 encoded input (as required by [§5](conformance.md#5-conformance), a conforming implementation can choose to accept and transform additional character encoding schemes (such as UTF-16, UTF-32, or non-Unicode character mappings).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the "can" here be "may"?


> *Note*: The handling of the Unicode NULL character (U+0000) is implementation-specific. It is strongly recommended that developers avoid using this character in their source code, for the sake of both portability and readability. When the character is required within a character or string literal, the escape sequences `\0` or `\u0000` may be used instead. *end note*
<!-- markdownlint-disable MD028 -->
Expand Down Expand Up @@ -351,7 +351,7 @@ token

### 6.4.2 Unicode character escape sequences

A Unicode escape sequence represents a Unicode code point. Unicode escape sequences are processed in identifiers ([§6.4.3](lexical-structure.md#643-identifiers)), character literals ([§6.4.5.5](lexical-structure.md#6455-character-literals)), regular string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)), and interpolated regular string expressions ([§12.8.3](expressions.md#1283-interpolated-string-expressions)). A Unicode escape sequence is not processed in any other location (for example, to form an operator, punctuator, or keyword).
A Unicode character escape sequence represents a Unicode code point. Unicode escape sequences are processed in identifiers ([§6.4.3](lexical-structure.md#643-identifiers)), character literals ([§6.4.5.5](lexical-structure.md#6455-character-literals)), regular string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)), and interpolated regular string expressions ([§12.8.3](expressions.md#1283-interpolated-string-expressions)). A Unicode escape sequence is not processed in any other location (for example, to form an operator, punctuator, or keyword).

```ANTLR
fragment Unicode_Escape_Sequence
Expand All @@ -361,7 +361,7 @@ fragment Unicode_Escape_Sequence
;
```

A Unicode character escape sequence represents the single Unicode code point formed by the hexadecimal number following the “\u” or “\U” characters. Since C# uses a 16-bit encoding of Unicode code points in character and string values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two Unicode surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported.
A *Unicode_Escape_Sequence* represents the Unicode code point whose value is the hexadecimal number following the “\u” or “\U” characters. Since C# uses UTF-16 encoding in `char` and `string` values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two UTF-16 surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion over §8.2.5 I suspect this para will require some careful rewriting if it is agreed implementation may use different (and multiple) storage models for strings as long as they conform to the API presenting them as UTF-16.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want to know more about where we're heading before saying either way. Let's discuss this in the meeting.

Copy link
Contributor

@Nigel-Ecma Nigel-Ecma Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An attempt to wordsmith which avoids the internal format of string:

Suggested change
A *Unicode_Escape_Sequence* represents the Unicode code point whose value is the hexadecimal number following the “\u” or “\U” characters. Since C# uses UTF-16 encoding in `char` and `string` values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two UTF-16 surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported.
A *Unicode_Escape_Sequence* represents the Unicode code point whose value is the hexadecimal number following the “\u” or “\U” characters. Unicode code points in the range `U+10000` to `U+10FFFF` require two UTF-16 surrogate code units; as C# `char` values are represented using a single UTF-16 code unit ([§8.3.6](types.md#836-integral-types)) Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported.


Multiple translations are not performed. For instance, the string literal `"\u005Cu005C"` is equivalent to `"\u005C"` rather than `"\"`.

Expand Down Expand Up @@ -805,7 +805,7 @@ The value of a real literal of type `float` or `double` is determined by using t

#### 6.4.5.5 Character literals

A character literal represents a single character, and consists of a character in quotes, as in `'a'`.
A character literal represents a single character as a UTF-16 code unit, and consists of a character or *Unicode_Escape_Sequence* in quotes, as in `'a'`, `'\u0061'`, or `'\U00000061'`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly repeat here that a Unicode_Escape_Sequence that represents a code point in the range U+10000 to U+10FFFF is not a valid character literal?


```ANTLR
Character_Literal
Expand Down Expand Up @@ -850,7 +850,7 @@ fragment Hexadecimal_Escape_Sequence
>
> *end note*

A hexadecimal escape sequence represents a single Unicode UTF-16 code unit, with the value formed by the hexadecimal number following “`\x`”.
A hexadecimal escape sequence represents a UTF-16 code unit, with the value formed by the hexadecimal number following “`\x`”.

If the value represented by a character literal is greater than `U+FFFF`, a compile-time error occurs.

Expand All @@ -876,7 +876,7 @@ The type of a *Character_Literal* is `char`.

#### 6.4.5.6 String literals

C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences.
C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences. Both forms use UTF-16 encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we miss this when considering interpolated string literals? It feels odd to not mention interpolated string literals anywhere within this section. (Probably not something to fix in this PR, but potentially worth a new issue. See what you think. Feel free to create one and assign it to me if you agree.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jskeet – Since when did C# have interpolated string literals? ;-)

That said we should make sure that the text around interpolated string expressions is correct Unicode-wise relative to this PR.

(The rules used in the definition of interpolated string expressions refer to the same Simple_Escape_Sequence, Hexadecimal_Escape_Sequence and Unicode_Escape_Sequence rules used in the definitions of string and character literals. However the clause for interpolated string expressions (§12.8.3) makes no reference to the escape sequences other than using these three rules in the grammar section.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I've always regarded interpolated strings as a form of string literal. Looks like I was wrong - 12.8.3 explicitly says:

Interpolated string expressions have two forms; regular (interpolated_regular_string_expression) and verbatim (interpolated_verbatim_string_expression); which are lexically similar to, but differ semantically from, the two forms of string literals (§6.4.5.6).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal discussion aside, I think we should revert this change for the same reason as in expressions.md.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree:

Suggested change
C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences. Both forms use UTF-16 encoding.
C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences.


A verbatim string literal consists of an `@` character followed by a double-quote character, zero or more characters, and a closing double-quote character.

Expand Down
4 changes: 2 additions & 2 deletions standard/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ The `dynamic` type is further described in [§8.7](types.md#87-the-dynamic-type)

### 8.2.5 The string type

The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent Unicode character strings.
The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following other suggestions we maybe revert this one as well:

Suggested change
The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units.
The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent Unicode character strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one. Even if they're stored in some other form internally strings still represent a sequence of UTF-16 code units, right? (For example, string x = "\U0001F600"; will always report a length of 2, with the x[0] being equal to \uD83D and x[1] being equal to \uDE00 right?)


Values of the `string` type can be written as string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)).

Expand Down Expand Up @@ -311,7 +311,7 @@ C# supports nine integral types: `sbyte`, `byte`, `short`, `ushort`, `int`, `uin
- The `uint` type represents unsigned 32-bit integers with values from `0` to `4294967295`, inclusive.
- The `long` type represents signed 64-bit integers with values from `-9223372036854775808` to `9223372036854775807`, inclusive.
- The `ulong` type represents unsigned 64-bit integers with values from `0` to `18446744073709551615`, inclusive.
- The `char` type represents unsigned 16-bit integers with values from `0` to `65535`, inclusive. The set of possible values for the `char` type corresponds to the Unicode character set.
- The `char` type represents unsigned 16-bit integers with values from `0` to `65535`, inclusive, as a UTF-16 code unit.
> *Note*: Although `char` has the same representation as `ushort`, not all operations permitted on one type are permitted on the other. *end note*

All signed integral types are represented using two’s complement format.
Expand Down
Loading