Skip to content

Commit a0ce655

Browse files
authored
Recipe: indentation-sensitive languages (#246)
1 parent f4baac0 commit a0ce655

File tree

1 file changed

+139
-0
lines changed

1 file changed

+139
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
---
2+
title: Indentation-sensitive languages
3+
weight: 300
4+
---
5+
6+
Some programming languages (such as Python, Haskell, and YAML) use indentation to denote nesting, as opposed to special non-whitespace tokens (such as `{` and `}` in C++/JavaScript).
7+
This can be difficult to express in the EBNF notation used for defining a language grammar in Langium, which is context-free.
8+
To achieve that, you can make use of synthetic tokens in the grammar which you would then redefine in a custom token builder.
9+
10+
Starting with Langium 3.2.0, such token builder (and an accompanying lexer) are provided for easy plugging into your language.
11+
They work by modifying the underlying token type generated for your indentation terminal tokens to use a custom matcher function instead that has access to more context than simple Regular Expressions, allowing it to store state and detect _changes_ in indentation levels.
12+
13+
## Configuring the token builder and lexer
14+
15+
To be able to use the indendation tokens in your grammar, you first have to import and register the [`IndentationAwareTokenBuilder`](https://eclipse-langium.github.io/langium/classes/langium.IndentationAwareTokenBuilder.html)
16+
and [`IndentationAwareLexer`](https://eclipse-langium.github.io/langium/classes/langium.IndentationAwareLexer.html)
17+
services in your module as such:
18+
19+
```ts
20+
import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium';
21+
22+
export const HelloWorldModule: Module<HelloWorldServices, PartialLangiumServices & HelloWorldAddedServices> = {
23+
// ...
24+
parser: {
25+
TokenBuilder: () => new IndentationAwareTokenBuilder(),
26+
Lexer: (services) => new IndentationAwareLexer(services),
27+
// ...
28+
},
29+
};
30+
```
31+
32+
The `IndentationAwareTokenBuilder` constructor optionally accepts an object defining the names of the tokens you used to denote indentation and whitespace in your `.langium` grammar file, as well as a list of delimiter tokens inside of which indentation should be ignored. It defaults to:
33+
```ts
34+
{
35+
indentTokenName: 'INDENT',
36+
dedentTokenName: 'DEDENT',
37+
whitespaceTokenName: 'WS',
38+
ignoreIndentationDelimiters: [],
39+
}
40+
```
41+
42+
### Ignoring indentation between specific tokens
43+
44+
Sometimes, it is necessary to ignore any indentation token inside some expressions, such as with tuples and lists in Python. For example, in the following statement:
45+
46+
```py
47+
x = [
48+
1,
49+
2
50+
]
51+
```
52+
53+
any indentation between `[` and `]` should be ignored.
54+
55+
To achieve similar behavior with the `IndentationAwareTokenBuilder`, the `ignoreIndentationDelimiters` option can be used.
56+
It accepts a list of pairs of token names (terminal or keyword) and turns off indentation token detection between each pair.
57+
58+
For example, if you construct the `IndentationAwareTokenBuilder` with the following options:
59+
60+
```ts
61+
new IndentationAwareTokenBuilder({
62+
ignoreIndentationDelimiters: [
63+
['[', ']'],
64+
['(', ')'],
65+
],
66+
})
67+
```
68+
69+
then no indentation tokens will be emitted between either of those pairs of tokens.
70+
71+
### Configuration options type safety
72+
73+
The `IndentationAwareTokenBuilder` supports generic type parameters to improve type-safety and IntelliSense of its options.
74+
This helps detect when a token name has been mistyped or changed in the grammar.
75+
The first generic parameter corresponds to the names of terminal tokens, while the second one corresonds to the names of keyword tokens.
76+
Both parameters are optional and can be imported from `./generated/ast.js` and used as such:
77+
78+
```ts
79+
import { MyLanguageTerminalNames, MyLanguageKeywordNames } from './generated/ast.js';
80+
import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium';
81+
82+
// ...
83+
export const HelloWorldModule: Module<HelloWorldServices, PartialLangiumServices & HelloWorldAddedServices> = {
84+
parser: {
85+
TokenBuilder: () => new IndentationAwareTokenBuilder<MyLanguageTerminalNames, MyLanguageKeywordNames>({
86+
ignoreIndentationDelimiters: [
87+
['L_BRAC', 'R_BARC'], // <-- This typo will now cause a TypeScript error
88+
]
89+
}),
90+
Lexer: (services) => new IndentationAwareLexer(services),
91+
},
92+
};
93+
```
94+
95+
## Writing the grammar
96+
97+
In your langium file, you have to define terminals with the same names you passed to `IndentationAwareTokenBuilder` (or the defaults shown above if you did not override them).
98+
For example, let's define the grammar for a simple version of Python with support for only `if` and `return` statements, and only booleans as expressions:
99+
100+
```langium
101+
grammar PythonIf
102+
103+
entry Statement: If | Return;
104+
105+
If:
106+
'if' condition=BOOLEAN ':'
107+
INDENT thenBlock+=Statement+
108+
DEDENT
109+
('else' ':'
110+
INDENT elseBlock+=Statement+
111+
DEDENT)?;
112+
113+
Return: 'return' value=BOOLEAN;
114+
115+
terminal BOOLEAN returns boolean: /true|false/;
116+
terminal INDENT: 'synthetic:indent';
117+
terminal DEDENT: 'synthetic:dedent';
118+
hidden terminal WS: /[\t ]+/;
119+
hidden terminal NL: /[\r\n]+/;
120+
```
121+
122+
The important terminals here are `INDENT`, `DEDENT`, and `WS`.
123+
`INDENT` and `DEDENT` are used to delimit a nested block, similar to `{` and `}` (respectively) in C-like languages.
124+
Note that `INDENT` indicates an **increase** in indentation, not just the existence of leading whitespace, which is why in the example above we used it only at the beginning of the block, not before every `Statement`.
125+
Additionally, the separation of `WS` from simply `\s+` to `[\t ]+` and `[\r\n]+` is necessary because a simple `\s+` will match the new line character, as well as any possible indentation after it. To ensure correct behavior, the token builder modifies the pattern of the `whitespaceTokenName` token to be `[\t ]+`, so a separate hidden token for new lines needs to be explicitly defined.
126+
127+
The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground.
128+
129+
With the default configuration and the grammar above, for the following code sample:
130+
131+
```py
132+
if true:
133+
return false
134+
else:
135+
if true:
136+
return true
137+
```
138+
139+
the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`.

0 commit comments

Comments
 (0)