From 1ada208d93ab6df1f10125b76deb3973490c2b23 Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Sat, 4 May 2024 11:28:49 -0400 Subject: [PATCH 1/7] tweak Unicode-related text --- standard/lexical-structure.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/standard/lexical-structure.md b/standard/lexical-structure.md index 621071779..92dfff6a5 100644 --- a/standard/lexical-structure.md +++ b/standard/lexical-structure.md @@ -10,7 +10,7 @@ Conceptually speaking, a program is compiled using three steps: 1. Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens. 1. Syntactic analysis, which translates the stream of tokens into executable code. -Conforming implementations shall accept Unicode compilation units encoded with the UTF-8 encoding form (as defined by the Unicode standard), and transform them into a sequence of Unicode characters. Implementations can choose to accept and transform additional character encoding schemes (such as UTF-16, UTF-32, or non-Unicode character mappings). +Apart from accepting UTF-8 encoded input (as required by [§5](conformance.md#5-conformance), a conforming implementation can choose to accept and transform additional character encoding schemes (such as UTF-16, UTF-32, or non-Unicode character mappings). > *Note*: The handling of the Unicode NULL character (U+0000) is implementation-specific. It is strongly recommended that developers avoid using this character in their source code, for the sake of both portability and readability. When the character is required within a character or string literal, the escape sequences `\0` or `\u0000` may be used instead. *end note* @@ -351,7 +351,7 @@ token ### 6.4.2 Unicode character escape sequences -A Unicode escape sequence represents a Unicode code point. Unicode escape sequences are processed in identifiers ([§6.4.3](lexical-structure.md#643-identifiers)), character literals ([§6.4.5.5](lexical-structure.md#6455-character-literals)), regular string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)), and interpolated regular string expressions ([§12.8.3](expressions.md#1283-interpolated-string-expressions)). A Unicode escape sequence is not processed in any other location (for example, to form an operator, punctuator, or keyword). +A Unicode character escape sequence represents a Unicode code point. Unicode escape sequences are processed in identifiers ([§6.4.3](lexical-structure.md#643-identifiers)), character literals ([§6.4.5.5](lexical-structure.md#6455-character-literals)), regular string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)), and interpolated regular string expressions ([§12.8.3](expressions.md#1283-interpolated-string-expressions)). A Unicode escape sequence is not processed in any other location (for example, to form an operator, punctuator, or keyword). ```ANTLR fragment Unicode_Escape_Sequence @@ -361,7 +361,7 @@ fragment Unicode_Escape_Sequence ; ``` -A Unicode character escape sequence represents the single Unicode code point formed by the hexadecimal number following the “\u” or “\U” characters. Since C# uses a 16-bit encoding of Unicode code points in character and string values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two Unicode surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported. +A *Unicode_Escape_Sequence* represents the Unicode code point whose value is the hexadecimal number following the “\u” or “\U” characters. Since C# uses UTF-16 encoding in `char` and `string` values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two UTF-16 surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported. Multiple translations are not performed. For instance, the string literal `"\u005Cu005C"` is equivalent to `"\u005C"` rather than `"\"`. @@ -805,7 +805,7 @@ The value of a real literal of type `float` or `double` is determined by using t #### 6.4.5.5 Character literals -A character literal represents a single character, and consists of a character in quotes, as in `'a'`. +A character literal represents a single character as a UTF-16 code unit, and consists of a character or *Unicode_Escape_Sequence* in quotes, as in `'a'`, `'\u0061'`, or `'\U00000061'`. ```ANTLR Character_Literal @@ -850,7 +850,7 @@ fragment Hexadecimal_Escape_Sequence > > *end note* -A hexadecimal escape sequence represents a single Unicode UTF-16 code unit, with the value formed by the hexadecimal number following “`\x`”. +A hexadecimal escape sequence represents a UTF-16 code unit, with the value formed by the hexadecimal number following “`\x`”. If the value represented by a character literal is greater than `U+FFFF`, a compile-time error occurs. @@ -876,7 +876,7 @@ The type of a *Character_Literal* is `char`. #### 6.4.5.6 String literals -C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences. +C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences. Both forms use UTF-16 encoding. A verbatim string literal consists of an `@` character followed by a double-quote character, zero or more characters, and a closing double-quote character. From 31be46d5292257345a470601321d7685c7f8467c Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Sat, 4 May 2024 11:35:13 -0400 Subject: [PATCH 2/7] tweak Unicode-related text --- standard/types.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/standard/types.md b/standard/types.md index 52c8d7740..897000533 100644 --- a/standard/types.md +++ b/standard/types.md @@ -107,7 +107,7 @@ The `dynamic` type is further described in [§8.7](types.md#87-the-dynamic-type) ### 8.2.5 The string type -The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent Unicode character strings. +The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units, whose endianness is implementation-defined. Values of the `string` type can be written as string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)). @@ -311,7 +311,7 @@ C# supports nine integral types: `sbyte`, `byte`, `short`, `ushort`, `int`, `uin - The `uint` type represents unsigned 32-bit integers with values from `0` to `4294967295`, inclusive. - The `long` type represents signed 64-bit integers with values from `-9223372036854775808` to `9223372036854775807`, inclusive. - The `ulong` type represents unsigned 64-bit integers with values from `0` to `18446744073709551615`, inclusive. -- The `char` type represents unsigned 16-bit integers with values from `0` to `65535`, inclusive. The set of possible values for the `char` type corresponds to the Unicode character set. +- The `char` type represents unsigned 16-bit integers with values from `0` to `65535`, inclusive, as a UTF-16 code unit. > *Note*: Although `char` has the same representation as `ushort`, not all operations permitted on one type are permitted on the other. *end note* All signed integral types are represented using two’s complement format. From 39725dddbaea12e888dc5e12f6bbd57186d9da11 Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Sat, 4 May 2024 11:37:02 -0400 Subject: [PATCH 3/7] tweak Unicode-related text --- standard/expressions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/standard/expressions.md b/standard/expressions.md index d4502735c..3a1ad5661 100644 --- a/standard/expressions.md +++ b/standard/expressions.md @@ -1334,7 +1334,7 @@ An *interpolated_string_expression* consists of `$`, `$@`, or `@$`, immediately Interpolated string expressions have two forms; regular (*interpolated_regular_string_expression*) and verbatim (*interpolated_verbatim_string_expression*); which are lexically similar to, but differ semantically from, the two forms of string -literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)). +literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)). Both forms use UTF-16 encoding. ```ANTLR interpolated_string_expression From 7f4aa4ba76aa87cda7dd23b4ac0f7705d2843ef8 Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Sat, 4 May 2024 11:41:17 -0400 Subject: [PATCH 4/7] tweak Unicode-related text --- standard/portability-issues.md | 1 + 1 file changed, 1 insertion(+) diff --git a/standard/portability-issues.md b/standard/portability-issues.md index 44a630cd1..7718536cd 100644 --- a/standard/portability-issues.md +++ b/standard/portability-issues.md @@ -28,6 +28,7 @@ A conforming implementation is required to document its choice of behavior in ea 1. The maximum value allowed for `Decimal_Digit+` in `PP_Line_Indicator` ([§6.5.8](lexical-structure.md#658-line-directives)). 1. The interpretation of the *input_characters* in the *pp_pragma-text* of a #pragma directive ([§6.5.9](lexical-structure.md#659-pragma-directives)). 1. The values of any application parameters passed to `Main` by the host environment prior to application startup ([§7.1](basic-concepts.md#71-application-startup)). +1. The endianness of UTF-16 code units in a UTF-16-encoded string literal or an instance of the class `string` ([§8.2.5](types.md#825-the-string-type)). 1. The precise structure of the expression tree, as well as the exact process for creating it, when an anonymous function is converted to an expression-tree ([§10.7.3](conversions.md#1073-evaluation-of-lambda-expression-conversions-to-expression-tree-types)). 1. The value returned when a stack allocation of size zero is made ([§12.8.21](expressions.md#12821-stack-allocation)). 1. Whether a `System.ArithmeticException` (or a subclass thereof) is thrown or the overflow goes unreported with the resulting value being that of the left operand, when in an `unchecked` context and the left operand of an integer division is the maximum negative `int` or `long` value and the right operand is `–1` ([§12.10.3](expressions.md#12103-division-operator)). From c2674d6470fbe631d9ed6b3a224cf3d87ef61c0c Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Mon, 24 Jun 2024 16:25:11 -0400 Subject: [PATCH 5/7] TweakUnicodeStuff From a608dec6714ec662f9a075687848d4e92fbf689a Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Wed, 20 Nov 2024 07:46:55 -0500 Subject: [PATCH 6/7] remove mention of endianness --- standard/types.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/standard/types.md b/standard/types.md index 897000533..e35328414 100644 --- a/standard/types.md +++ b/standard/types.md @@ -107,7 +107,7 @@ The `dynamic` type is further described in [§8.7](types.md#87-the-dynamic-type) ### 8.2.5 The string type -The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units, whose endianness is implementation-defined. +The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units. Values of the `string` type can be written as string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)). From 074c17657e1ee44f1c4beb3b85dd127512169d62 Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Wed, 20 Nov 2024 07:47:42 -0500 Subject: [PATCH 7/7] remove mention of endianness --- standard/portability-issues.md | 1 - 1 file changed, 1 deletion(-) diff --git a/standard/portability-issues.md b/standard/portability-issues.md index 7718536cd..44a630cd1 100644 --- a/standard/portability-issues.md +++ b/standard/portability-issues.md @@ -28,7 +28,6 @@ A conforming implementation is required to document its choice of behavior in ea 1. The maximum value allowed for `Decimal_Digit+` in `PP_Line_Indicator` ([§6.5.8](lexical-structure.md#658-line-directives)). 1. The interpretation of the *input_characters* in the *pp_pragma-text* of a #pragma directive ([§6.5.9](lexical-structure.md#659-pragma-directives)). 1. The values of any application parameters passed to `Main` by the host environment prior to application startup ([§7.1](basic-concepts.md#71-application-startup)). -1. The endianness of UTF-16 code units in a UTF-16-encoded string literal or an instance of the class `string` ([§8.2.5](types.md#825-the-string-type)). 1. The precise structure of the expression tree, as well as the exact process for creating it, when an anonymous function is converted to an expression-tree ([§10.7.3](conversions.md#1073-evaluation-of-lambda-expression-conversions-to-expression-tree-types)). 1. The value returned when a stack allocation of size zero is made ([§12.8.21](expressions.md#12821-stack-allocation)). 1. Whether a `System.ArithmeticException` (or a subclass thereof) is thrown or the overflow goes unreported with the resulting value being that of the left operand, when in an `unchecked` context and the left operand of an integer division is the maximum negative `int` or `long` value and the right operand is `–1` ([§12.10.3](expressions.md#12103-division-operator)).