You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`Developing major modes with tree-sitter` (From the Emacs 29+ Manual, `C-h i`, search for `tree-sitter`)
12
13
13
14
In short:
14
-
Tree-sitter is a tool that generates parser libraries for programming languages, and provides an API for interacting with those parsers.
15
-
The generated parsers can create syntax trees from source code text.
16
-
The nodes of those trees are defined by the grammar.
17
-
Emacs can use these generated parsers to provide major modes with things like syntax highlighting, indentation, navigation, structural editing, and many other things.
15
+
16
+
- Tree-sitter is a tool that generates parser libraries for programming languages, and provides an API for interacting with those parsers.
17
+
- The generated parsers can create syntax trees from source code text.
18
+
- The nodes of those trees are defined by the grammar.
19
+
- Emacs can use these generated parsers to provide major modes with things like syntax highlighting, indentation, navigation, structural editing, and many other things.
18
20
19
21
## Important Definitions
20
22
21
-
- Parser: A dynamic library compiled from C source code that is generated by the tree-sitter tool. A parser reads source code for a particular language and produces a syntax tree.
22
-
- Grammar: The rules that define how a parser will create the syntax tree for a language. The grammar is written in javascript. Tree-sitter tooling consumes the grammar as input and outputs C source (which can be compiled into a parser)
23
+
- Parser: A dynamic library compiled from C source code that is generated by the Tree-sitter tool. A parser reads source code for a particular language and produces a syntax tree.
24
+
- Grammar: The rules that define how a parser will create the syntax tree for a language. The grammar is written in JavaScript. Tree-sitter tooling consumes the grammar as input and outputs C source (which can be compiled into a parser)
23
25
- Syntax Tree: a tree data structure comprised of syntax nodes that represents some source code text.
24
-
- Concrete Syntax Tree: Syntax trees that contain nodes for every token in the source code, including things likes brackets and parentheses. Tree-sitter creates Concrete Syntax Trees.
25
-
- Abstract Syntax Tree: A syntax tree with less important details removed. An AST may contain a node for a list, but not individual parentheses. Tree-sitter does not create Abstract Syntax Trees.
26
+
- Concrete Syntax Tree: Syntax trees that contain nodes for every token in the source code, including things likes brackets and parentheses. Tree-sitter creates Concrete Syntax Trees.
27
+
- Abstract Syntax Tree: A syntax tree with less important details removed. An AST may contain a node for a list, but not individual parentheses. Tree-sitter does not create Abstract Syntax Trees.
26
28
- Syntax Node: A node in a syntax tree. It represents some subset of a source code text. Each node has a type, defined by the grammar used to produce it. Some common node types represent language constructs like strings, integers, operators.
27
-
- Named Syntax Node: A node that can be identified by a name given to it in the tree-sitter Grammar. In clojure-ts-mode, `list_lit` is a named node for lists.
28
-
- Anonymous Syntax Node: A node that cannot be identified by a name. In the Grammar these are identified by simple strings, not by complex Grammar rules. In clojure-ts-mode, `"("` and `")"` are anonymous nodes.
29
+
- Named Syntax Node: A node that can be identified by a name given to it in the Tree-sitter Grammar. In clojure-ts-mode, `list_lit` is a named node for lists.
30
+
- Anonymous Syntax Node: A node that cannot be identified by a name. In the Grammar these are identified by simple strings, not by complex Grammar rules. In clojure-ts-mode, `"("` and `")"` are anonymous nodes.
29
31
- Font Locking: What Emacs calls "Syntax Highlighting".
30
32
31
33
## tree-sitter-clojure
32
34
33
-
Clojure-ts-mode uses the tree-sitter-clojure grammar, which can be found at https://github.com/sogaiu/tree-sitter-clojure
34
-
The clojure-ts-mode grammar provides very basic, low level nodes that try to match clojure's very light syntax.
35
+
Clojure-ts-mode uses the tree-sitter-clojure grammar, which can be found at <https://github.com/sogaiu/tree-sitter-clojure>
36
+
The clojure-ts-mode grammar provides very basic, low level nodes that try to match Clojure's very light syntax.
35
37
36
38
There are nodes to represent:
39
+
37
40
- Symbols (sym_lit)
38
-
- Contain (sym_ns) and (sym_name) nodes
41
+
- Contain (sym_ns) and (sym_name) nodes
39
42
- Keywords (kwd_lit)
40
-
- Contain (kwd_ns) and (kw_name) nodes
43
+
- Contain (kwd_ns) and (kw_name) nodes
41
44
- Strings (str_lit)
42
45
- Chars (char_lit)
43
46
- Nil (nil_lit)
44
47
- Booleans (bool_lit)
45
48
- Numbers (num_lit)
46
49
- Comments (comment, dis_expr)
47
-
- dis_expr is the `#_` discard expression
50
+
- dis_expr is the `#_` discard expression
48
51
- Lists (list_list)
49
52
- Vectors (vec_lit)
50
53
- Maps (map_lit)
@@ -61,7 +64,7 @@ will produce a parse tree like so
61
64
62
65
```
63
66
(vec_lit
64
-
meta: (meta_lit
67
+
meta: (meta_lit
65
68
value: (kwd_lit name: (kwd_name)))
66
69
value: (num_lit))
67
70
```
@@ -70,12 +73,12 @@ The best place to learn more about the tree-sitter-clojure grammar is to read th
70
73
71
74
### Clojure Syntax, not Clojure Semantics
72
75
73
-
An important observation that anyone familiar with popular tree-sitter grammars may have picked up on is that there are no nodes representing things like functions, macros, types, and other semantic concepts.
74
-
Representing the semantics of Clojure in a tree-sitter grammar is much more difficult than traditional languages that do not use macros heavily like Clojure and other lisps.
75
-
To understand what an expression represents in Clojure source code requires macro-expansion of the source code.
76
-
Macro-expansion requires a runtime, and tree-sitter does not have access to a Clojure runtime and will never have access to a Clojure runtime.
77
-
Additionally tree-sitter never looks back on what it has parsed, only forward, considering what is directly ahead of it. So even if it could identify a macro like `myspecialdef` it would forget about it as soon as it moved passed the declaring `defmacro` node.
78
-
Another way to think about this: tree-sitter is designed to be fast and good-enough for tooling to implement syntax highlighting, indentation, and other editing conveniences. It is not meant for interpreting and execution.
76
+
An important observation that anyone familiar with popular Tree-sitter grammars may have picked up on is that there are no nodes representing things like functions, macros, types, and other semantic concepts.
77
+
Representing the semantics of Clojure in a Tree-sitter grammar is much more difficult than traditional languages that do not use macros heavily like Clojure and other lisps.
78
+
To understand what an expression represents in Clojure source code requires macro-expansion of the source code.
79
+
Macro-expansion requires a runtime, and Tree-sitter does not have access to a Clojure runtime and will never have access to a Clojure runtime.
80
+
Additionally Tree-sitter never looks back on what it has parsed, only forward, considering what is directly ahead of it. So even if it could identify a macro like `myspecialdef` it would forget about it as soon as it moved passed the declaring `defmacro` node.
81
+
Another way to think about this: Tree-sitter is designed to be fast and good-enough for tooling to implement syntax highlighting, indentation, and other editing conveniences. It is not meant for interpreting and execution.
79
82
80
83
#### Example 1: False Negative Function Classification
81
84
@@ -88,9 +91,8 @@ Consider the following macro
88
91
(defn2dog [] "bark")
89
92
```
90
93
91
-
92
94
This macro lets the caller define a function, but a hypothetical tree-sitter-clojure semantic grammar might just see a function call where a variable dog is passed as an argument.
93
-
How should tree-sitter know that `dog` should be highlighted like function? It would have to evaluate the `defn2` macro to understand that.
95
+
How should Tree-sitter know that `dog` should be highlighted like function? It would have to evaluate the `defn2` macro to understand that.
94
96
95
97
#### Example 2: False Positive Function Classification
96
98
@@ -105,13 +107,13 @@ How should tree-sitter know that `dog` should be highlighted like function? It w
105
107
106
108
evaluates to 1, and the following
107
109
108
-
```
110
+
```clojure
109
111
(foo)
110
112
```
111
113
112
114
evaluates to 1.
113
115
114
-
How is tree-sitter supposed to understand that `(defn foo [] 2)` of the expression `(no-defn (defn foo [] 2))` is not a function declaration? It would have to evaluate the `no-defn` macro.
116
+
How is Tree-sitter supposed to understand that `(defn foo [] 2)` of the expression `(no-defn (defn foo [] 2))` is not a function declaration? It would have to evaluate the `no-defn` macro.
115
117
116
118
#### Syntax and Semantics: Conclusions
117
119
@@ -122,17 +124,27 @@ Instead, it is up to the emacs-lisp code and other consumers of the tree-sitter-
122
124
123
125
There are some pros and cons of this decision for tree-sitter-clojure to only consider syntax and not semantics.
124
126
Some of the (non-exhaustive) upsides:
127
+
125
128
- No semantic false positives or negatives in the parse tree.
126
129
- Simple grammar to maintain with less nodes and rules
127
130
- Small, fast grammar (with a small set of grammar rules, tree-sitter-clojure has one of the smallest binaries and fastest grammars in widespread use)
128
131
- Stability: the grammar changes infrequently and is very stable for downstream consumers
129
132
130
-
And the primary downside: Semantics must be (re)-implemented in tools that consume the grammar. While this results in more work for tooling authors, the tools that use the grammar are easier to change than the grammar itself. The inaccurate nature of statically interpreting Clojure semantics means that not every decision made for the grammar would meet the needs of the various grammar consumers. This would lead to bugs and feature requests. Nearly all changes to the grammar will result in some sort of breakages to its consumers, so changes are best avoided once the grammar has stabilized. Therefore avoiding these semantic interpretations in the grammar is one of the best ways to minimize changes in the grammar.
133
+
And the primary downside: Semantics must be (re)-implemented in tools that
134
+
consume the grammar. While this results in more work for tooling authors, the
135
+
tools that use the grammar are easier to change than the grammar itself. The
136
+
inaccurate nature of statically interpreting Clojure semantics means that not
137
+
every decision made for the grammar would meet the needs of the various grammar
138
+
consumers. This would lead to bugs and feature requests. Nearly all changes to
139
+
the grammar will result in some sort of breakages to its consumers, so changes
140
+
are best avoided once the grammar has stabilized. Therefore avoiding these
141
+
semantic interpretations in the grammar is one of the best ways to minimize
0 commit comments