Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

richard-uk1 · 2020-12-10T18:55:39Z

This fixes #67544. There will be some regression in diagnostics that need fixing before merge. I want to get feedback before I sink time into this though that the patch might be accepted.

It will fail CI because some compile-fail messages have changed, but I would still like feedback.

I expect if it does get merged, it will be after considerable discussion.

Also requires tests before merge.

Change lexer to treat 'e' after number as part of a suffix unless it is followed by a valid exponent.

rust-highfive · 2020-12-10T18:55:42Z

r? @lcnr

(rust-highfive has picked a reviewer for you, use r? to override)

lcnr · 2020-12-10T19:03:03Z

r? @petrochenkov maybe, they are more knowledgeable about this.

this also requires t-lang signoff

richard-uk1 · 2020-12-11T13:14:20Z

It would be nice to move the exponent parsing into the parser, so all tokens in proc macros get a number followed by a suffix, but this would be a conflagration of concerns.

This isn't really ready for review so I'll close and re-open once it's further along.

TODOs for me

the token 1e+ (1, e, +, <non-digit>) should be the integer 1with the suffixe, a +` symbol, and then the lexer should start parsing another token.

petrochenkov · 2020-12-11T13:28:59Z

@derekdreery

It would be nice to move the exponent parsing into the parser

That's pretty much the plan, see the discussion in #71322.
It's also a part of a larger plan on lexer & parser librarification and code sharing with rust-analyzer on which @matklad started working sometime around #76170, but stopped quickly after that. It would be great if someone could pick up that work.

matklad · 2020-12-11T13:45:05Z

Yeah, sorry about over promising on that front -- life happened. I am intending to pick that work eventually, no promises as to when this'll actually happen this time :)

richard-uk1 · 2020-12-11T18:02:29Z

@petrochenkov @matklad cool I will have a look. It would probably be better to work from sone existing code, so I can get a feel for how to write stuff in line with the rest of the code.

richard-uk1 · 2020-12-12T11:22:26Z

I'm going to write a little plan here for people to comment on:

The lexer will be unaware of exponential notation. 1e6 will be lexed as {integer} 1 with suffix e6. 1e-2 will be lexed as {integer} 1 with suffix e followed by - followed by int 2. 1.2e3 will lex as {float} 1.2 with suffix e3 and so on.
An early stage in the parser will match and combine valid exponentials (e.g. convert 3e2 into {float} 300, covert {float} 3.2 with suffix e, +, {int} 1 into {float} 32), but leave any invalid exponentials untouched (e.g. leave 3e as {int} 3 with suffix e).
This is what proc macros get - rather than failing at the lexer stage if an e does not lead to a valid exponential, proc macros will get the unparsed tokens. If an e does match a valid exponential, then the proc macro gets a parsed float as is currently the case, to maintain backwards compatibility and to strike a balance between convenience and flexibility (parsing exponents is more convenient but less flexible, in any case we have to do this to be backwards compat).
I believe there is also an issue with a token like 1.2ff - rather than getting the token the proc_macro will get an error about expected f32 or f64. I need to check if this is actually the case, as it's based on a vague memory ATM. I should also see if this kind of thing affects u{8, 16, ...} for integers.
Ensure that diagnostics are the same or improved. Specifically, if a token like 1e, or tokens like [1e, +, something] are found, emit a "did you mean to use an exponent....".

EDIT

TODO: think about pathological cases like 1e+2ee (which should be parsed as {float} 100 with suffix ee), 1e+1e+1 would be {float} 10 with suffix e, +, {integer} 1.

Concern: How to handle 1e + 2e, since this would end up parsed as {float} 100 with suffix e, which is probably not intended. How to handle this correctly? Because whitespace matters, this is a problem that comes from moving the task out of the lexer.

petrochenkov · 2020-12-12T19:31:22Z

With "fine-grained" tokens lexer will not produce float tokens at all, only integer tokens (possibly suffixed) and punctuation.
So 1.2e3 is [int(1), punct(.), int(2e3)] (with spacing information kept).

petrochenkov · 2020-12-12T20:07:31Z

I think we right now we can change behavior for cases returning rustc_lexer::LiteralKind::Float { empty_exponent: true } from lexer, because they are unconditionally reported as errors currently, so it will be strictly a language extension rather than a change.

For this fn is_next_exponent should behave identical to existing fn eat_float_exponent.
So instead of introducing is_next_exponent, eat_float_exponent could return a three-variant enum instead of bool, something like

enum {
    GoodFloatExponentIndeed, // e123, e+123, e-123
    BadFloatExponent, // e+, e-
    JustASuffixStartingWithE, // e, e___
}

(Note that eat_decimal_digits eats underscores in addition to digits, so we need to be careful with cases with no digits, for example.)

richard-uk1 · 2020-12-13T14:51:25Z

@petrochenkov @matklad is there any documentation on how you want the lexer/parser to look after the update? If there is, I could help work towards it.

petrochenkov · 2020-12-27T12:32:01Z

@derekdreery
I think @matklad made a tracking issue for this (or described the plan in on of the PRs/issues), but I can't find it right now.
The minimal changes to e suffix (#79912 (comment)) are not blocked on any big plans though.

richard-uk1 · 2020-12-27T16:18:28Z

@petrochenkov Ok that's what I'll make this PR 😄

petrochenkov · 2021-02-23T10:16:13Z

Closing due to inactivity.

richard-uk1 · 2021-02-23T16:50:51Z

If I pick this up again I'll make a new PR.

…ustc_session, r=<try> move some invalid exponent detection into rustc_session This PR moves part of the exponent checks from `rustc_lexer`/`rustc_parser` into `rustc_session`. This change does not affect which programs are accepted by the complier, or the diagnostics that are reported, with one main exception. That exception is that floats or ints with suffixes beginning with `e` are rejected *after* the token stream is passed to proc macros, rather than being rejected by the parser as was the case. This gives proc macro authors more consistent access to numeric literals: currently a proc macro could interpret `1m` or `30s` but not `7eggs` or `3em`. After this change all are handled the same. The lexer will still reject input if it contains `e` followed by a number, `+`/`-`, or `_` if they are not followed by a valid integer literal (number + `_`), but this doesn't affect macro authors who just want to access alpha suffixes. This PR is a continuation of rust-lang#79912. It is also solving exactly the same problem as [rust-lang#111628](rust-lang#111628). Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like `1e_______________23`) but nevertheless it is handled correctly. Also adds tests for various edge cases and improves diagnostics marginally. r: `@petrochenkov,` since they reviewed rust-lang#79912.

Change lexer to treat 'e' after number as suffix

3ee510a

Change lexer to treat 'e' after number as part of a suffix unless it is followed by a valid exponent.

rust-highfive assigned lcnr Dec 10, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 10, 2020

jonas-schievink added needs-fcp This change is insta-stable, so needs a completed FCP to proceed. T-lang Relevant to the language team, which will review and decide on the PR/issue. labels Dec 10, 2020

rust-highfive assigned petrochenkov and unassigned lcnr Dec 10, 2020

petrochenkov mentioned this pull request Dec 12, 2020

Properly capture trailing 'unglued' token #79978

Merged

richard-uk1 marked this pull request as draft December 27, 2020 16:18

petrochenkov closed this Feb 23, 2021

richard-uk1 mentioned this pull request Oct 13, 2024

lexer: Treat more floats with empty exponent as valid tokens #131656

Open

richard-uk1 deleted the lexer_number_accept_trailing_e branch February 23, 2025 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

richard-uk1 commented Dec 10, 2020 •

edited

Loading

rust-highfive commented Dec 10, 2020

lcnr commented Dec 10, 2020

richard-uk1 commented Dec 11, 2020 •

edited

Loading

petrochenkov commented Dec 11, 2020

matklad commented Dec 11, 2020

richard-uk1 commented Dec 11, 2020

richard-uk1 commented Dec 12, 2020 •

edited

Loading

petrochenkov commented Dec 12, 2020

petrochenkov commented Dec 12, 2020

richard-uk1 commented Dec 13, 2020

petrochenkov commented Dec 27, 2020

richard-uk1 commented Dec 27, 2020

petrochenkov commented Feb 23, 2021

richard-uk1 commented Feb 23, 2021

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

Conversation

richard-uk1 commented Dec 10, 2020 • edited Loading

rust-highfive commented Dec 10, 2020

lcnr commented Dec 10, 2020

richard-uk1 commented Dec 11, 2020 • edited Loading

TODOs for me

petrochenkov commented Dec 11, 2020

matklad commented Dec 11, 2020

richard-uk1 commented Dec 11, 2020

richard-uk1 commented Dec 12, 2020 • edited Loading

petrochenkov commented Dec 12, 2020

petrochenkov commented Dec 12, 2020

richard-uk1 commented Dec 13, 2020

petrochenkov commented Dec 27, 2020

richard-uk1 commented Dec 27, 2020

petrochenkov commented Feb 23, 2021

richard-uk1 commented Feb 23, 2021

richard-uk1 commented Dec 10, 2020 •

edited

Loading

richard-uk1 commented Dec 11, 2020 •

edited

Loading

richard-uk1 commented Dec 12, 2020 •

edited

Loading