Skip to content

Commit 5afb503

Browse files
authored
Merge pull request #1459 from mattheww/2024-01_input_format
Input format
2 parents 5440070 + 8ba3c49 commit 5afb503

File tree

4 files changed

+90
-65
lines changed

4 files changed

+90
-65
lines changed

src/comments.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
>    | INNER_BLOCK_DOC
3131
>
3232
> _IsolatedCR_ :\
33-
>    _A `\r` not followed by a `\n`_
33+
>    \\r
3434
3535
## Non-doc comments
3636

@@ -53,8 +53,9 @@ that follows. That is, they are equivalent to writing `#![doc="..."]` around
5353
the body of the comment. `//!` comments are usually used to document
5454
modules that occupy a source file.
5555

56-
Isolated CRs (`\r`), i.e. not followed by LF (`\n`), are not allowed in doc
57-
comments.
56+
The character `U+000D` (CR) is not allowed in doc comments.
57+
58+
> **Note**: The sequence `U+000D` (CR) immediately followed by `U+000A` (LF) would have been previously transformed into a single `U+000A` (LF).
5859
5960
## Examples
6061

src/crates-and-source-files.md

+3-39
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,9 @@
22

33
> **<sup>Syntax</sup>**\
44
> _Crate_ :\
5-
> &nbsp;&nbsp; UTF8BOM<sup>?</sup>\
6-
> &nbsp;&nbsp; SHEBANG<sup>?</sup>\
75
> &nbsp;&nbsp; [_InnerAttribute_]<sup>\*</sup>\
86
> &nbsp;&nbsp; [_Item_]<sup>\*</sup>
97
10-
> **<sup>Lexer</sup>**\
11-
> UTF8BOM : `\uFEFF`\
12-
> SHEBANG : `#!` \~`\n`<sup>\+</sup>[](#shebang)
13-
14-
158
> Note: Although Rust, like any other language, can be implemented by an
169
> interpreter as well as a compiler, the only existing implementation is a
1710
> compiler, and the language has always been designed to be compiled. For these
@@ -53,6 +46,8 @@ that apply to the containing module, most of which influence the behavior of
5346
the compiler. The anonymous crate module can have additional attributes that
5447
apply to the crate as a whole.
5548

49+
> **Note**: The file's contents may be preceded by a [shebang].
50+
5651
```rust
5752
// Specify the crate name.
5853
#![crate_name = "projx"]
@@ -65,34 +60,6 @@ apply to the crate as a whole.
6560
#![warn(non_camel_case_types)]
6661
```
6762

68-
## Byte order mark
69-
70-
The optional [_UTF8 byte order mark_] (UTF8BOM production) indicates that the
71-
file is encoded in UTF8. It can only occur at the beginning of the file and
72-
is ignored by the compiler.
73-
74-
## Shebang
75-
76-
A source file can have a [_shebang_] (SHEBANG production), which indicates
77-
to the operating system what program to use to execute this file. It serves
78-
essentially to treat the source file as an executable script. The shebang
79-
can only occur at the beginning of the file (but after the optional
80-
_UTF8BOM_). It is ignored by the compiler. For example:
81-
82-
<!-- ignore: tests don't like shebang -->
83-
```rust,ignore
84-
#!/usr/bin/env rustx
85-
86-
fn main() {
87-
println!("Hello!");
88-
}
89-
```
90-
91-
A restriction is imposed on the shebang syntax to avoid confusion with an
92-
[attribute]. The `#!` characters must not be followed by a `[` token, ignoring
93-
intervening [comments] or [whitespace]. If this restriction fails, then it is
94-
not treated as a shebang, but instead as the start of an attribute.
95-
9663
## Preludes and `no_std`
9764

9865
This section has been moved to the [Preludes chapter](names/preludes.md).
@@ -161,20 +128,17 @@ or `_` (U+005F) characters.
161128
[_InnerAttribute_]: attributes.md
162129
[_Item_]: items.md
163130
[_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax
164-
[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
165-
[_utf8 byte order mark_]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
166131
[`ExitCode`]: ../std/process/struct.ExitCode.html
167132
[`Infallible`]: ../std/convert/enum.Infallible.html
168133
[`Termination`]: ../std/process/trait.Termination.html
169134
[attribute]: attributes.md
170135
[attributes]: attributes.md
171-
[comments]: comments.md
172136
[function]: items/functions.md
173137
[module]: items/modules.md
174138
[module path]: paths.md
139+
[shebang]: input-format.md#shebang-removal
175140
[trait or lifetime bounds]: trait-bounds.md
176141
[where clauses]: items/generics.md#where-clauses
177-
[whitespace]: whitespace.md
178142

179143
<script>
180144
(function() {

src/input-format.md

+53-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,55 @@
11
# Input format
22

3-
Rust input is interpreted as a sequence of Unicode code points encoded in UTF-8.
3+
This chapter describes how a source file is interpreted as a sequence of tokens.
4+
5+
See [Crates and source files] for a description of how programs are organised into files.
6+
7+
## Source encoding
8+
9+
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
10+
It is an error if the file is not valid UTF-8.
11+
12+
## Byte order mark removal
13+
14+
If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.
15+
16+
## CRLF normalization
17+
18+
Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
19+
20+
Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
21+
22+
## Shebang removal
23+
24+
If the remaining sequence begins with the characters `!#`, the characters up to and including the first `U+000A` (LF) are removed from the sequence.
25+
26+
For example, the first line of the following file would be ignored:
27+
28+
<!-- ignore: tests don't like shebang -->
29+
```rust,ignore
30+
#!/usr/bin/env rustx
31+
32+
fn main() {
33+
println!("Hello!");
34+
}
35+
```
36+
37+
As an exception, if the `#!` characters are followed (ignoring intervening [comments] or [whitespace]) by a `[` token, nothing is removed.
38+
This prevents an [inner attribute] at the start of a source file being removed.
39+
40+
> **Note**: The standard library [`include!`] macro applies byte order mark removal, CRLF normalization, and shebang removal to the file it reads. The [`include_str!`] and [`include_bytes!`] macros do not.
41+
42+
## Tokenization
43+
44+
The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
45+
46+
47+
[`include!`]: ../std/macro.include.md
48+
[`include_bytes!`]: ../std/macro.include_bytes.md
49+
[`include_str!`]: ../std/macro.include_str.md
50+
[inner attribute]: attributes.md
51+
[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
52+
[comments]: comments.md
53+
[Crates and source files]: crates-and-source-files.md
54+
[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
55+
[whitespace]: whitespace.md

src/tokens.md

+30-22
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ Literals are tokens used in [literal expressions].
3737

3838
[^nsets]: The number of `#`s on each side of the same literal must be equivalent.
3939

40+
> **Note**: Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
41+
4042
#### ASCII escapes
4143

4244
| | Name |
@@ -156,13 +158,10 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
156158
`U+0022` (double-quote) characters, with the exception of `U+0022` itself,
157159
which must be _escaped_ by a preceding `U+005C` character (`\`).
158160

159-
Line-breaks are allowed in string literals.
160-
A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
161-
Both byte sequences are translated to `U+000A`.
162-
161+
Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals.
163162
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
164163
See [String continuation escapes] for details.
165-
164+
The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.
166165

167166
#### Character escapes
168167

@@ -198,10 +197,10 @@ following forms:
198197
199198
Raw string literals do not process any escapes. They start with the character
200199
`U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a
201-
`U+0022` (double-quote) character. The _raw string body_ can contain any sequence
202-
of Unicode characters and is terminated only by another `U+0022` (double-quote)
203-
character, followed by the same number of `U+0023` (`#`) characters that preceded
204-
the opening `U+0022` (double-quote) character.
200+
`U+0022` (double-quote) character.
201+
202+
The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR).
203+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
205204

206205
All Unicode characters contained in the raw string body represent themselves,
207206
the characters `U+0022` (double-quote) (except when followed by at least as
@@ -259,6 +258,11 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
259258
Alternatively, a byte string literal can be a _raw byte string literal_, defined
260259
below.
261260

261+
Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals.
262+
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
263+
See [String continuation escapes] for details.
264+
The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
265+
262266
Some additional _escapes_ are available in either byte or non-raw byte string
263267
literals. An escape starts with a `U+005C` (`\`) and continues with one of the
264268
following forms:
@@ -281,19 +285,19 @@ following forms:
281285
> &nbsp;&nbsp; `br` RAW_BYTE_STRING_CONTENT SUFFIX<sup>?</sup>
282286
>
283287
> RAW_BYTE_STRING_CONTENT :\
284-
> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII<sup>* (non-greedy)</sup> `"`\
288+
> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII_FOR_RAW<sup>* (non-greedy)</sup> `"`\
285289
> &nbsp;&nbsp; | `#` RAW_BYTE_STRING_CONTENT `#`
286290
>
287-
> ASCII :\
288-
> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F)_
291+
> ASCII_FOR_RAW :\
292+
> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_
289293
290294
Raw byte string literals do not process any escapes. They start with the
291295
character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256
292-
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
293-
_raw string body_ can contain any sequence of ASCII characters and is terminated
294-
only by another `U+0022` (double-quote) character, followed by the same number of
295-
`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote)
296-
character. A raw byte string literal can not contain any non-ASCII byte.
296+
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
297+
298+
The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR).
299+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
300+
A raw byte string literal can not contain any non-ASCII byte.
297301

298302
All characters contained in the raw string body represent their ASCII encoding,
299303
the characters `U+0022` (double-quote) (except when followed by at least as
@@ -339,6 +343,11 @@ C strings are implicitly terminated by byte `0x00`, so the C string literal
339343
literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not
340344
permitted within a C string.
341345

346+
Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals.
347+
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
348+
See [String continuation escapes] for details.
349+
The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
350+
342351
Some additional _escapes_ are available in non-raw C string literals. An escape
343352
starts with a `U+005C` (`\`) and continues with one of the following forms:
344353

@@ -381,11 +390,10 @@ c"\xC3\xA6";
381390
382391
Raw C string literals do not process any escapes. They start with the
383392
character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256
384-
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
385-
_raw C string body_ can contain any sequence of Unicode characters (other than
386-
`U+0000`) and is terminated only by another `U+0022` (double-quote) character,
387-
followed by the same number of `U+0023` (`#`) characters that preceded the
388-
opening `U+0022` (double-quote) character.
393+
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
394+
395+
The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR).
396+
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
389397

390398
All characters contained in the raw C string body represent themselves in UTF-8
391399
encoding. The characters `U+0022` (double-quote) (except when followed by at

0 commit comments

Comments
 (0)