|
| 1 | +--- |
| 2 | +title: Indentation-sensitive languages |
| 3 | +weight: 300 |
| 4 | +--- |
| 5 | + |
| 6 | +Some programming languages (such as Python, Haskell, and YAML) use indentation to denote nesting, as opposed to special non-whitespace tokens (such as `{` and `}` in C++/JavaScript). |
| 7 | +This can be difficult to express in the EBNF notation used for defining a language grammar in Langium, which is context-free. |
| 8 | +To achieve that, you can make use of synthetic tokens in the grammar which you would then redefine in a custom token builder. |
| 9 | + |
| 10 | +Starting with Langium 3.2.0, such token builder (and an accompanying lexer) are provided for easy plugging into your language. |
| 11 | +They work by modifying the underlying token type generated for your indentation terminal tokens to use a custom matcher function instead that has access to more context than simple Regular Expressions, allowing it to store state and detect _changes_ in indentation levels. |
| 12 | + |
| 13 | +## Configuring the token builder and lexer |
| 14 | + |
| 15 | +To be able to use the indendation tokens in your grammar, you first have to import and register the [`IndentationAwareTokenBuilder`](https://eclipse-langium.github.io/langium/classes/langium.IndentationAwareTokenBuilder.html) |
| 16 | +and [`IndentationAwareLexer`](https://eclipse-langium.github.io/langium/classes/langium.IndentationAwareLexer.html) |
| 17 | +services in your module as such: |
| 18 | + |
| 19 | +```ts |
| 20 | +import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium'; |
| 21 | + |
| 22 | +export const HelloWorldModule: Module<HelloWorldServices, PartialLangiumServices & HelloWorldAddedServices> = { |
| 23 | + // ... |
| 24 | + parser: { |
| 25 | + TokenBuilder: () => new IndentationAwareTokenBuilder(), |
| 26 | + Lexer: (services) => new IndentationAwareLexer(services), |
| 27 | + // ... |
| 28 | + }, |
| 29 | +}; |
| 30 | +``` |
| 31 | + |
| 32 | +The `IndentationAwareTokenBuilder` constructor optionally accepts an object defining the names of the tokens you used to denote indentation and whitespace in your `.langium` grammar file, as well as a list of delimiter tokens inside of which indentation should be ignored. It defaults to: |
| 33 | +```ts |
| 34 | +{ |
| 35 | + indentTokenName: 'INDENT', |
| 36 | + dedentTokenName: 'DEDENT', |
| 37 | + whitespaceTokenName: 'WS', |
| 38 | + ignoreIndentationDelimiters: [], |
| 39 | +} |
| 40 | +``` |
| 41 | + |
| 42 | +### Ignoring indentation between specific tokens |
| 43 | + |
| 44 | +Sometimes, it is necessary to ignore any indentation token inside some expressions, such as with tuples and lists in Python. For example, in the following statement: |
| 45 | + |
| 46 | +```py |
| 47 | +x = [ |
| 48 | + 1, |
| 49 | + 2 |
| 50 | +] |
| 51 | +``` |
| 52 | + |
| 53 | +any indentation between `[` and `]` should be ignored. |
| 54 | + |
| 55 | +To achieve similar behavior with the `IndentationAwareTokenBuilder`, the `ignoreIndentationDelimiters` option can be used. |
| 56 | +It accepts a list of pairs of token names (terminal or keyword) and turns off indentation token detection between each pair. |
| 57 | + |
| 58 | +For example, if you construct the `IndentationAwareTokenBuilder` with the following options: |
| 59 | + |
| 60 | +```ts |
| 61 | +new IndentationAwareTokenBuilder({ |
| 62 | + ignoreIndentationDelimiters: [ |
| 63 | + ['[', ']'], |
| 64 | + ['(', ')'], |
| 65 | + ], |
| 66 | +}) |
| 67 | +``` |
| 68 | + |
| 69 | +then no indentation tokens will be emitted between either of those pairs of tokens. |
| 70 | + |
| 71 | +### Configuration options type safety |
| 72 | + |
| 73 | +The `IndentationAwareTokenBuilder` supports generic type parameters to improve type-safety and IntelliSense of its options. |
| 74 | +This helps detect when a token name has been mistyped or changed in the grammar. |
| 75 | +The first generic parameter corresponds to the names of terminal tokens, while the second one corresonds to the names of keyword tokens. |
| 76 | +Both parameters are optional and can be imported from `./generated/ast.js` and used as such: |
| 77 | + |
| 78 | +```ts |
| 79 | +import { MyLanguageTerminalNames, MyLanguageKeywordNames } from './generated/ast.js'; |
| 80 | +import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium'; |
| 81 | + |
| 82 | +// ... |
| 83 | +export const HelloWorldModule: Module<HelloWorldServices, PartialLangiumServices & HelloWorldAddedServices> = { |
| 84 | + parser: { |
| 85 | + TokenBuilder: () => new IndentationAwareTokenBuilder<MyLanguageTerminalNames, MyLanguageKeywordNames>({ |
| 86 | + ignoreIndentationDelimiters: [ |
| 87 | + ['L_BRAC', 'R_BARC'], // <-- This typo will now cause a TypeScript error |
| 88 | + ] |
| 89 | + }), |
| 90 | + Lexer: (services) => new IndentationAwareLexer(services), |
| 91 | + }, |
| 92 | +}; |
| 93 | +``` |
| 94 | + |
| 95 | +## Writing the grammar |
| 96 | + |
| 97 | +In your langium file, you have to define terminals with the same names you passed to `IndentationAwareTokenBuilder` (or the defaults shown above if you did not override them). |
| 98 | +For example, let's define the grammar for a simple version of Python with support for only `if` and `return` statements, and only booleans as expressions: |
| 99 | + |
| 100 | +```langium |
| 101 | +grammar PythonIf |
| 102 | +
|
| 103 | +entry Statement: If | Return; |
| 104 | +
|
| 105 | +If: |
| 106 | + 'if' condition=BOOLEAN ':' |
| 107 | + INDENT thenBlock+=Statement+ |
| 108 | + DEDENT |
| 109 | + ('else' ':' |
| 110 | + INDENT elseBlock+=Statement+ |
| 111 | + DEDENT)?; |
| 112 | +
|
| 113 | +Return: 'return' value=BOOLEAN; |
| 114 | +
|
| 115 | +terminal BOOLEAN returns boolean: /true|false/; |
| 116 | +terminal INDENT: 'synthetic:indent'; |
| 117 | +terminal DEDENT: 'synthetic:dedent'; |
| 118 | +hidden terminal WS: /[\t ]+/; |
| 119 | +hidden terminal NL: /[\r\n]+/; |
| 120 | +``` |
| 121 | + |
| 122 | +The important terminals here are `INDENT`, `DEDENT`, and `WS`. |
| 123 | +`INDENT` and `DEDENT` are used to delimit a nested block, similar to `{` and `}` (respectively) in C-like languages. |
| 124 | +Note that `INDENT` indicates an **increase** in indentation, not just the existence of leading whitespace, which is why in the example above we used it only at the beginning of the block, not before every `Statement`. |
| 125 | +Additionally, the separation of `WS` from simply `\s+` to `[\t ]+` and `[\r\n]+` is necessary because a simple `\s+` will match the new line character, as well as any possible indentation after it. To ensure correct behavior, the token builder modifies the pattern of the `whitespaceTokenName` token to be `[\t ]+`, so a separate hidden token for new lines needs to be explicitly defined. |
| 126 | + |
| 127 | +The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground. |
| 128 | + |
| 129 | +With the default configuration and the grammar above, for the following code sample: |
| 130 | + |
| 131 | +```py |
| 132 | +if true: |
| 133 | + return false |
| 134 | +else: |
| 135 | + if true: |
| 136 | + return true |
| 137 | +``` |
| 138 | + |
| 139 | +the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`. |
0 commit comments