You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rust input is interpreted as a sequence of Unicode code points encoded in UTF-8.
3
+
This chapter describes how a source file is interpreted as a sequence of tokens.
4
+
5
+
See [Crates and source files] for a description of how programs are organised into files.
6
+
7
+
## Source encoding
8
+
9
+
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
10
+
It is an error if the file is not valid UTF-8.
11
+
12
+
## Byte order mark removal
13
+
14
+
If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.
15
+
16
+
## CRLF normalization
17
+
18
+
Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
19
+
20
+
Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
21
+
22
+
## Shebang removal
23
+
24
+
If the remaining sequence begins with the characters `!#`, the characters up to and including the first `U+000A` (LF) are removed from the sequence.
25
+
26
+
For example, the first line of the following file would be ignored:
27
+
28
+
<!-- ignore: tests don't like shebang -->
29
+
```rust,ignore
30
+
#!/usr/bin/env rustx
31
+
32
+
fn main() {
33
+
println!("Hello!");
34
+
}
35
+
```
36
+
37
+
As an exception, if the `#!` characters are followed (ignoring intervening [comments] or [whitespace]) by a `[` token, nothing is removed.
38
+
This prevents an [inner attribute] at the start of a source file being removed.
39
+
40
+
> **Note**: The standard library [`include!`] macro applies byte order mark removal, CRLF normalization, and shebang removal to the file it reads. The [`include_str!`] and [`include_bytes!`] macros do not.
41
+
42
+
## Tokenization
43
+
44
+
The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
45
+
46
+
47
+
[`include!`]: ../std/macro.include.md
48
+
[`include_bytes!`]: ../std/macro.include_bytes.md
49
+
[`include_str!`]: ../std/macro.include_str.md
50
+
[inner attribute]: attributes.md
51
+
[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
52
+
[comments]: comments.md
53
+
[Crates and source files]: crates-and-source-files.md
Copy file name to clipboardExpand all lines: src/tokens.md
+30-22
Original file line number
Diff line number
Diff line change
@@ -37,6 +37,8 @@ Literals are tokens used in [literal expressions].
37
37
38
38
[^nsets]: The number of `#`s on each side of the same literal must be equivalent.
39
39
40
+
> **Note**: Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
41
+
40
42
#### ASCII escapes
41
43
42
44
|| Name |
@@ -156,13 +158,10 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
156
158
`U+0022` (double-quote) characters, with the exception of `U+0022` itself,
157
159
which must be _escaped_ by a preceding `U+005C` character (`\`).
158
160
159
-
Line-breaks are allowed in string literals.
160
-
A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
161
-
Both byte sequences are translated to `U+000A`.
162
-
161
+
Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals.
163
162
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
164
163
See [String continuation escapes] for details.
165
-
164
+
The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.
166
165
167
166
#### Character escapes
168
167
@@ -198,10 +197,10 @@ following forms:
198
197
199
198
Raw string literals do not process any escapes. They start with the character
200
199
`U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a
201
-
`U+0022` (double-quote) character. The _raw string body_ can contain any sequence
202
-
of Unicode characters and is terminated only by another `U+0022` (double-quote)
203
-
character, followed by the same number of `U+0023` (`#`) characters that preceded
204
-
the opening `U+0022` (double-quote) character.
200
+
`U+0022` (double-quote) character.
201
+
202
+
The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR).
203
+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
205
204
206
205
All Unicode characters contained in the raw string body represent themselves,
207
206
the characters `U+0022` (double-quote) (except when followed by at least as
@@ -259,6 +258,11 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
259
258
Alternatively, a byte string literal can be a _raw byte string literal_, defined
260
259
below.
261
260
261
+
Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals.
262
+
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
263
+
See [String continuation escapes] for details.
264
+
The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
265
+
262
266
Some additional _escapes_ are available in either byte or non-raw byte string
263
267
literals. An escape starts with a `U+005C` (`\`) and continues with one of the
> _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_
289
293
290
294
Raw byte string literals do not process any escapes. They start with the
291
295
character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256
292
-
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
293
-
_raw string body_ can contain any sequence of ASCII characters and is terminated
294
-
only by another `U+0022` (double-quote) character, followed by the same number of
295
-
`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote)
296
-
character. A raw byte string literal can not contain any non-ASCII byte.
296
+
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
297
+
298
+
The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR).
299
+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
300
+
A raw byte string literal can not contain any non-ASCII byte.
297
301
298
302
All characters contained in the raw string body represent their ASCII encoding,
299
303
the characters `U+0022` (double-quote) (except when followed by at least as
@@ -339,6 +343,11 @@ C strings are implicitly terminated by byte `0x00`, so the C string literal
339
343
literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not
340
344
permitted within a C string.
341
345
346
+
Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals.
347
+
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
348
+
See [String continuation escapes] for details.
349
+
The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
350
+
342
351
Some additional _escapes_ are available in non-raw C string literals. An escape
343
352
starts with a `U+005C` (`\`) and continues with one of the following forms:
344
353
@@ -381,11 +390,10 @@ c"\xC3\xA6";
381
390
382
391
Raw C string literals do not process any escapes. They start with the
383
392
character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256
384
-
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
385
-
_raw C string body_ can contain any sequence of Unicode characters (other than
386
-
`U+0000`) and is terminated only by another `U+0022` (double-quote) character,
387
-
followed by the same number of `U+0023` (`#`) characters that preceded the
388
-
opening `U+0022` (double-quote) character.
393
+
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
394
+
395
+
The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR).
396
+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
389
397
390
398
All characters contained in the raw C string body represent themselves in UTF-8
391
399
encoding. The characters `U+0022` (double-quote) (except when followed by at
0 commit comments