|
| 1 | +# Parser generator |
| 2 | + |
| 3 | +Not only other parser generators for web weren't written here, but they lack a set of features we really need: |
| 4 | + |
| 5 | +- **Type-safety**: API of generated parser should be typed without `any` |
| 6 | +- **AST from grammar**: converting untyped trees to AST is unsafe and boring |
| 7 | +- <sup>TBD</sup> **CST**: pretty-printer has to keep comments `/**/`, underscores in numbers `1_234` and other features that are nowhere represented in AST. |
| 8 | +- **Named lexemes**: good error messages shouldn't report an identifier as "a-z, A-Z, 0-9, or _". |
| 9 | +- <sup>TBD</sup> **Error recovery**: programming languages should report more than one error at a time. |
| 10 | +- <sup>TBD</sup> **Incremental**: reparse shouldn't take time proprtional to size of the file. |
| 11 | +- **High-order rules `A<B>`**: duplicated code leads to increased chance to make a mistake, and high-order rules are required for duplication. |
| 12 | +- <sup>TBD</sup> **No stack overflow on large expressions**: nested constructions might lead to stack overflow. |
| 13 | +- **Space skipping**: manually annotating grammar with spaces is error-prone and boring. |
| 14 | + |
| 15 | +## Comparison to peggy |
| 16 | + |
| 17 | +`pgen` mostly follows grammar of [peggy](https://peggyjs.org/documentation.html#grammar-syntax-and-semantics) with a few notable differences. |
| 18 | + |
| 19 | +- Capitalized rules `Foo = ...` create AST nodes with `{ $: 'Foo' }`. |
| 20 | +- Rules have to end with semicolon `;`. |
| 21 | +- Inline semantic actions `{ return 42; }` are not supported. We can't infer types of AST when there is some inlined JavaScript code, because JS is untyped. |
| 22 | +- High-order rules `A<B> = ...` were added. |
| 23 | +- Space skipping was added. It uses `space` rule. |
| 24 | +- Lexification operator `#` was added. |
| 25 | +- Character classes do not support modifiers `[a-z]i`. |
| 26 | + |
| 27 | +## Syntax reference |
| 28 | + |
| 29 | +- Non-AST rule defintion `rule = ...;` |
| 30 | +- AST rule defintion `Rule = ...`. Returns an object with `{ $: 'Rule', loc: Loc }` with rest of the fields defined with named clauses in right-hand side. |
| 31 | +- Display override for error messaging `Id "identifier" = ...;` |
| 32 | +- High-order rule defintion `inter<A, B> = ...;` and call `inter<expression, ",">` |
| 33 | +- Left-biased choice `"A" / "B"`. Will match the first matching clause. |
| 34 | +- Sequence `foo bar baz`. All clauses should match in sequence. |
| 35 | +- Named clauses `"if" "(" expr:expression ")" stmts:statements`. Sequence operator generates an object, and named clauses become its fields `{ expr: ..., stmts: ... }`. |
| 36 | +- Picked clause `"if" "(" @expression ")"`. Sequence operator returns only a single value of picked clause. |
| 37 | +- Single clause sequence `a = b`. Works as `a = @b`. |
| 38 | +- Negative lookahead `!x`. Fails if `x` matches. Doesn't consume input. |
| 39 | +- Positive lookahead `&x`. Passes if `x` matches. Doesn't consume input. |
| 40 | +- Stringification `$x`. Ignores AST computed by x, returns string that `x` matched. |
| 41 | +- Lexification `#x`. Does not skip spaces inside of `x`. If `x` calls some other rules, doesn't skip spaces there either. |
| 42 | +- Repeat `x*`. |
| 43 | +- Repeat at least once `x+`. |
| 44 | +- Optional `x?`. |
| 45 | +- String `"abc"`. |
| 46 | +- Character class `[a-z_]`. Supports ranges `a-z`. Supports negation `[^a-z]`. |
| 47 | + |
| 48 | +## Implicit syntax |
| 49 | + |
| 50 | +- Spaces are skipped after every terminal: `"string"`, `[a-z]` |
| 51 | +- Spaces are skipped after lexification operator `#x` |
| 52 | +- Spaces are not skipped inside lexification operator `#x`. |
| 53 | +- Spaces are skipped at the start, before rest of the parsing will happen |
| 54 | +- If not the whole input was consumed, error will be emitted |
0 commit comments