Rewrite the grammar #333

amaanq · 2024-04-30T19:42:04Z

Reduces state count (tbh, I had it greatly reduced to around ~7-8k, but a few grievances grew it back to nearly what it was before, ~12k, further optimizations can come once I dig more later)
Fully implements C# 12 and some of 13, including primary constructors, ref readonly parameters, and aliasing any type.
Fixes parsing bugs related to interpolated strings when more than one $ is used by delegating this to the external scanner for statefulness by leveraging a stack of the current interpolation's info (yes, nested interpolations are handled nicely, no, it was not fun at times to implement)
Adds new bindings for other languages that upstream supports, tidied up manifests
CI is more robust using new upstream workflows
Improved queries
Updated tests, and removing duplicates/excessive tests. There's no need for one test per line of code/tree sitter rule, it just adds unnecessary higher-level nodes when several of these tests can be combined into one test to test them all together
Rewrote grammar to be more in-tune with the tree-sitter style, including getting rid of useless precedences, rules, hidden or not, inlining where appropriate, supertyping relevant rules, unhiding certain rules/supertypes, etc.
Improved preproc calls by having content be nested inside these nodes, but this does break the case where it's in the middle of an if statement for example, similar to C. I think the tradeoff is worth it, only ~80 files out of 8k+ files fail because of this, and we get much nicer parse trees in the common case, which, e.g., improves code folding. Another added benefit is we can use preproc rules in spots where only that rule itself is valid, and not everything. e.g. a preproc containing expressions is valid inside expressions, but this won't be done at the top level.
Fuzz the scanner for obvious reasons

I will update the tokens for publishing the new version later today (needs pypi).

Closes Error on conditional compilation preprocessor directive #189
Closes Improve interpolated raw string literal parsing #283 (todo - quotes)
Closes Unhide _preprocessor_call #310
Closes Parser errors on malformed C# code different in playground #319
Closes Using statement with a function call with parameters is not parsed as invocation, instead a variable declaration #326
Closes Release 0.21 compatible version #331
Closes Support primary constructor #332

tamasvajk

Thank you for this fix/improvement/contribution. This is going to be awesome.

I find this PR rather difficult to review. I think the first commit could also contain the test output changes. That would make it easier to see if anything has been broken.

I think there are a couple of bugs that are (re)introduced with this refactoring.

some of the contextual keyword don’t seem to work any longer, for example parameter p in void m(scoped p) { } doesn’t have a type. Probably scoped became a modifier.
var o = (Int32)(1); became an invocation with a parenthesised identifier, instead of a cast. Cast incorrectly parsed as invocation #175
var greetings = new (string head, string tail)[3]; is not an array creation expression in the equals clause. Array of named tuples incorrectly parsed as element_access_expression instead of array_creation_expression #166

There are a lot of changes that I don't understand. Can you explain what's the benefit of changing these:

some field names have been removed or changed.
null_literal is gone, so throw null is now a throw_expression without any child.
equals_value_clause has been removed from the grammar,
type_parameter_constraints_clause has been reworked, previously we had a list of clauses with target and constraint, now we have a flat list of constraints, in the form of identifier, type_parameter_constraint+, identifier, type_parameter_constraint+. Isn't this more difficult to handle in for consumers of the output tree?
with_initializer_expression is gone, which means with_expression contains a flat list of children. This way the tree seems to be more difficult to process.

amaanq · 2024-05-01T14:12:21Z

Wow, those two dynamic precedence issues caught me off guard. I noticed some places where prec.dynamic was used unnecessarily, and I thought those two were as well. That was really interesting and after a little digging, it makes sense. This is probably one of few grammars that is very sensitive to dynamic precedence changes. Thanks for the context, I've added them back

Others:

A lot of the fields were unnecessary, however, if you think some of the ones i removed were necessary, let me know and I can add them back
I added this back, it's generally preferable to not assign a node to a single literal since it just "absorbs" that literal and you can't directly query for the string literal then, but for consistency with all the other literals it makes sense
I don't see the point in this, unless it resolved ambiguities/reduced state count but I don't think it did
I remember this initially reduced the state count a lot when I was starting my rewrite, but now I noticed it doesn't, so yeah I can just have it as it was before
Added back, a field called initializer could work though if we just want to tag/distinguish the identifier though

grammar.js

Cargo.toml

damieng · 2024-05-01T18:12:43Z

While I appreciate you've put a lot of effort in I'm also disappointed such a massive PR has landed with zero up-front discussion.

The extra plumbing around packaging and bringing in-line with other grammars is much appreciated but some of these changes would have been a lot easier to review with a bit of thought up-front - e.g. not re-ordering grammar.js where unnecessary and keeping the corpus files in their old location so we can diff (rename/move breaks after a certain amount of change).

Are there regressions with existing parsing? I can see a large list of exclusions on the CI but it's unclear if these are new.

amaanq · 2024-05-01T18:19:21Z

Well I wasn't aware I had to discuss making large improvements beforehand, nor have I ever done so in any other upstream grammar I maintain.

The moving of tests to a different dir is necessary, top-level corpus dirs are unsupported upstream now and must be in test/corpus.

The exclusion list is either a parse error where it genuinely was an error (beforehand as well) or it has funky usage of preproc ifs, e.g. in the middle of an if/else statement. I have (imo) improved them such that their contents are children of the preproc_if node, much like how C does it, however, this requires that the contents inside be somewhat correctly formed (e.g. a regular statement/expression, wherever the preproc_if is applicable). I think this is a reasonable tradeoff for much more navigable trees.

Tamás politely and graciously pointed out a couple of mistakes regarding some funky dynamic precedence, which I was a little taken aback by as I explained earlier, but those are now fixed.

amaanq · 2024-05-01T20:33:12Z

@damieng @tamasvajk the diff for tests should be much more readable now

test/corpus/literals.txt

maxbrunsfeld · 2024-05-01T21:18:25Z

test/corpus/classes.txt

@@ -83,119 +89,114 @@ file class A {}
  (class_declaration
    (modifier)
    name: (identifier)
-    body: (declaration_list))
+    (declaration_list))


IMO the body field was fine, but I don't know if there was some drawback to it. Curious if you were seeing significant code size increases due to that field.

There's a few of these removed in places that @tamasvajk added (not just body but all sorts of fields) - are they used by GH Semantic?

I think the fields are mostly used by GH CodeQL. @hvitved knows best how much we need them.

Having field names in general certainly makes the life easier for Treesitter based QL extractors. Field names are used to generate methods in CodeQL, e.g. this field in the Ruby grammar gives rise to this QL predicate, which is nicer to use than something like getChild(0).

queries/highlights.scm

maxbrunsfeld · 2024-05-01T21:30:05Z

Hope you're able to reduce the state count. In general, a great way to do that is avoid long sequences with many variations.

In other words, turn this:

seq(
  optional('foo'),
  optional('bar'),
  optional('baz'),
  optional('quux'),
  // ...
)

into this:

seq(
  repeat(choice('foo', 'bar', 'baz', 'quux')),
  // ...
)

It seems like there may be opportunities to do some of that in this grammar - there may be unnecessarily-specific sequences that could be modeled more generically.

test/corpus/attributes.txt

test/corpus/contextual-keywords.txt

test/corpus/expressions.txt

damieng · 2024-05-01T22:37:47Z

test/corpus/expressions.txt

-                  (block
-                    (return_statement
-                      (identifier))))))))
+              (anonymous_method_expression


Why is it we're losing "static" as a (modifier) here but we're keeping static as a modifier on the lambda equivalent on line 1262?

I can add that back for consistency, but what about static/unsafe in using_directive? I also (personally) think it's ok to hide modifiers, and just expose the string literals in the tree instead, but I don't have much of a preference.

test/corpus/enums.txt

test/corpus/interfaces.txt

test/corpus/preprocessor.txt

test/corpus/interfaces.txt

test/corpus/type-methods.txt

test/corpus/attributes.txt

test/corpus/preprocessor.txt

test/corpus/type-fields.txt

grammar.js

damieng · 2024-05-03T12:30:59Z

This is shaping up great! If we can get attribute_argument back in then I'm good on the Roslyn alignment/back compat from the naming side.

I think we just need to check/adjust the changes and removals of fields to make sure Semantic doesn't break and we should be good to go.

dcreager · 2024-05-03T13:03:01Z

I think we just need to check/adjust the changes and removals of fields to make sure Semantic doesn't break and we should be good to go.

Thanks for checking in, 👍 from our side! We're pinned to the current release so we won't silently upgrade, and we're using the syntax highlighting queries directly from this repo, which I see are updated as part of this PR. We have augmented the tagging queries with the ability to build up scoped names, which we'll have to update as part of bumping to the new version containing these changes. But the changes to tags.scm in this PR look manageable, so I don't consider that a blocker.

… count

amaanq · 2024-05-03T22:43:15Z

thanks for the review/feedback @maxbrunsfeld @damieng @tamasvajk @hvitved!

hvitved · 2024-05-12T18:58:25Z

Improved preproc calls by having content be nested inside these nodes, but this does break the case where it's in the middle of an if statement for example, similar to C. I think the tradeoff is worth it, only ~80 files out of 8k+ files fail because of this, and we get much nicer parse trees in the common case, which, e.g., improves code folding. Another added benefit is we can use preproc rules in spots where only that rule itself is valid, and not everything. e.g. a preproc containing expressions is valid inside expressions, but this won't be done at the top level.

Is there any way, using extras as before this PR, to still tolerate #ifs that happen inside expressions or statements? For example, this file appears to have more parse errors now than before.

tamasvajk reviewed May 1, 2024

View reviewed changes

tamasvajk mentioned this pull request May 1, 2024

Improve interpolated raw string literal parsing #283

Closed

amaanq force-pushed the rewrite branch 2 times, most recently from a2418d5 to ec9cd66 Compare May 1, 2024 14:16

hvitved reviewed May 1, 2024

View reviewed changes

grammar.js Outdated Show resolved Hide resolved

amaanq force-pushed the rewrite branch from ec9cd66 to 8149c64 Compare May 1, 2024 17:26

damieng reviewed May 1, 2024

View reviewed changes

Cargo.toml Show resolved Hide resolved

amaanq force-pushed the rewrite branch from 8149c64 to 8cacd96 Compare May 1, 2024 20:14

amaanq mentioned this pull request May 1, 2024

Cleanup #334

Merged

amaanq force-pushed the rewrite branch 4 times, most recently from 2d1c659 to 74ae67c Compare May 1, 2024 20:33

amaanq force-pushed the rewrite branch 2 times, most recently from f290108 to eb18043 Compare May 1, 2024 20:58

maxbrunsfeld reviewed May 1, 2024

View reviewed changes

damieng reviewed May 1, 2024

View reviewed changes

test/corpus/attributes.txt Outdated Show resolved Hide resolved

damieng reviewed May 1, 2024

View reviewed changes

test/corpus/contextual-keywords.txt Outdated Show resolved Hide resolved

damieng reviewed May 1, 2024

View reviewed changes

test/corpus/expressions.txt Show resolved Hide resolved

damieng reviewed May 1, 2024

View reviewed changes

test/corpus/expressions.txt Outdated Show resolved Hide resolved

damieng reviewed May 1, 2024

View reviewed changes

test/corpus/enums.txt Outdated Show resolved Hide resolved

damieng reviewed May 1, 2024

View reviewed changes

test/corpus/interfaces.txt Outdated Show resolved Hide resolved

damieng reviewed May 1, 2024

View reviewed changes

test/corpus/preprocessor.txt Outdated Show resolved Hide resolved

amaanq force-pushed the rewrite branch from eb18043 to 67e226f Compare May 2, 2024 00:12

amaanq force-pushed the rewrite branch from 7f2128e to 21315e8 Compare May 2, 2024 19:00

damieng reviewed May 3, 2024

View reviewed changes

test/corpus/interfaces.txt Show resolved Hide resolved

damieng reviewed May 3, 2024

View reviewed changes

test/corpus/type-methods.txt Show resolved Hide resolved

damieng reviewed May 3, 2024

View reviewed changes

test/corpus/attributes.txt Outdated Show resolved Hide resolved

damieng reviewed May 3, 2024

View reviewed changes

test/corpus/preprocessor.txt Show resolved Hide resolved

damieng reviewed May 3, 2024

View reviewed changes

test/corpus/type-fields.txt Show resolved Hide resolved

hvitved reviewed May 3, 2024

View reviewed changes

grammar.js Outdated Show resolved Hide resolved

amaanq added 6 commits May 3, 2024 16:53

feat!: rewrite the grammar to support missing features & reduce state…

058edeb

… count

feat: update queries

1d33eab

test: update tests

645cde7

ci: use upstream workflows for testing

c709f0b

build: update bindings and manifests

1ddc4fa

0.21.0

4eb2019

amaanq force-pushed the rewrite branch from 21315e8 to 4eb2019 Compare May 3, 2024 20:53

amaanq requested a review from damieng May 3, 2024 20:56

damieng approved these changes May 3, 2024

View reviewed changes

amaanq merged commit 437e89c into master May 3, 2024
5 checks passed

amaanq deleted the rewrite branch May 3, 2024 22:42

hvitved mentioned this pull request May 6, 2024

Include literal content in the parse tree #335

Merged

aryx mentioned this pull request Jul 2, 2024

Update to a more recent tree-sitter-c-sharp before the big refactor semgrep/ocaml-tree-sitter-semgrep#492

Merged

1 task

kritzcreek mentioned this pull request Aug 14, 2024

chore: Updates tree-sitter version sourcegraph/sourcegraph-public-snapshot#64403

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the grammar #333

Rewrite the grammar #333

amaanq commented Apr 30, 2024

tamasvajk left a comment •

edited

Loading

amaanq commented May 1, 2024 •

edited

Loading

damieng commented May 1, 2024

amaanq commented May 1, 2024 •

edited

Loading

amaanq commented May 1, 2024

maxbrunsfeld May 1, 2024

damieng May 1, 2024 •

edited

Loading

tamasvajk May 2, 2024

hvitved May 2, 2024

maxbrunsfeld commented May 1, 2024

damieng May 1, 2024

amaanq May 1, 2024 •

edited

Loading

damieng commented May 3, 2024

dcreager commented May 3, 2024

amaanq commented May 3, 2024

hvitved commented May 12, 2024

Rewrite the grammar #333

Rewrite the grammar #333

Conversation

amaanq commented Apr 30, 2024

tamasvajk left a comment • edited Loading

Choose a reason for hiding this comment

amaanq commented May 1, 2024 • edited Loading

damieng commented May 1, 2024

amaanq commented May 1, 2024 • edited Loading

amaanq commented May 1, 2024

maxbrunsfeld May 1, 2024

Choose a reason for hiding this comment

damieng May 1, 2024 • edited Loading

Choose a reason for hiding this comment

tamasvajk May 2, 2024

Choose a reason for hiding this comment

hvitved May 2, 2024

Choose a reason for hiding this comment

maxbrunsfeld commented May 1, 2024

damieng May 1, 2024

Choose a reason for hiding this comment

amaanq May 1, 2024 • edited Loading

Choose a reason for hiding this comment

damieng commented May 3, 2024

dcreager commented May 3, 2024

amaanq commented May 3, 2024

hvitved commented May 12, 2024

tamasvajk left a comment •

edited

Loading

amaanq commented May 1, 2024 •

edited

Loading

amaanq commented May 1, 2024 •

edited

Loading

damieng May 1, 2024 •

edited

Loading

amaanq May 1, 2024 •

edited

Loading