Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the grammar #333

Merged
merged 6 commits into from
May 3, 2024
Merged

Rewrite the grammar #333

merged 6 commits into from
May 3, 2024

Conversation

amaanq
Copy link
Member

@amaanq amaanq commented Apr 30, 2024

  • Reduces state count (tbh, I had it greatly reduced to around ~7-8k, but a few grievances grew it back to nearly what it was before, ~12k, further optimizations can come once I dig more later)
  • Fully implements C# 12 and some of 13, including primary constructors, ref readonly parameters, and aliasing any type.
  • Fixes parsing bugs related to interpolated strings when more than one $ is used by delegating this to the external scanner for statefulness by leveraging a stack of the current interpolation's info (yes, nested interpolations are handled nicely, no, it was not fun at times to implement)
  • Adds new bindings for other languages that upstream supports, tidied up manifests
  • CI is more robust using new upstream workflows
  • Improved queries
  • Updated tests, and removing duplicates/excessive tests. There's no need for one test per line of code/tree sitter rule, it just adds unnecessary higher-level nodes when several of these tests can be combined into one test to test them all together
  • Rewrote grammar to be more in-tune with the tree-sitter style, including getting rid of useless precedences, rules, hidden or not, inlining where appropriate, supertyping relevant rules, unhiding certain rules/supertypes, etc.
  • Improved preproc calls by having content be nested inside these nodes, but this does break the case where it's in the middle of an if statement for example, similar to C. I think the tradeoff is worth it, only ~80 files out of 8k+ files fail because of this, and we get much nicer parse trees in the common case, which, e.g., improves code folding. Another added benefit is we can use preproc rules in spots where only that rule itself is valid, and not everything. e.g. a preproc containing expressions is valid inside expressions, but this won't be done at the top level.
  • Fuzz the scanner for obvious reasons

I will update the tokens for publishing the new version later today (needs pypi).

Copy link
Collaborator

@tamasvajk tamasvajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this fix/improvement/contribution. This is going to be awesome.

I find this PR rather difficult to review. I think the first commit could also contain the test output changes. That would make it easier to see if anything has been broken.

I think there are a couple of bugs that are (re)introduced with this refactoring.

There are a lot of changes that I don't understand. Can you explain what's the benefit of changing these:

  • some field names have been removed or changed.
  • null_literal is gone, so throw null is now a throw_expression without any child.
  • equals_value_clause has been removed from the grammar,
  • type_parameter_constraints_clause has been reworked, previously we had a list of clauses with target and constraint, now we have a flat list of constraints, in the form of identifier, type_parameter_constraint+, identifier, type_parameter_constraint+. Isn't this more difficult to handle in for consumers of the output tree?
  • with_initializer_expression is gone, which means with_expression contains a flat list of children. This way the tree seems to be more difficult to process.

@amaanq
Copy link
Member Author

amaanq commented May 1, 2024

Wow, those two dynamic precedence issues caught me off guard. I noticed some places where prec.dynamic was used unnecessarily, and I thought those two were as well. That was really interesting and after a little digging, it makes sense. This is probably one of few grammars that is very sensitive to dynamic precedence changes. Thanks for the context, I've added them back

Others:

  • A lot of the fields were unnecessary, however, if you think some of the ones i removed were necessary, let me know and I can add them back
  • I added this back, it's generally preferable to not assign a node to a single literal since it just "absorbs" that literal and you can't directly query for the string literal then, but for consistency with all the other literals it makes sense
  • I don't see the point in this, unless it resolved ambiguities/reduced state count but I don't think it did
  • I remember this initially reduced the state count a lot when I was starting my rewrite, but now I noticed it doesn't, so yeah I can just have it as it was before
  • Added back, a field called initializer could work though if we just want to tag/distinguish the identifier though

@amaanq amaanq force-pushed the rewrite branch 2 times, most recently from a2418d5 to ec9cd66 Compare May 1, 2024 14:16
@damieng
Copy link
Collaborator

damieng commented May 1, 2024

While I appreciate you've put a lot of effort in I'm also disappointed such a massive PR has landed with zero up-front discussion.

The extra plumbing around packaging and bringing in-line with other grammars is much appreciated but some of these changes would have been a lot easier to review with a bit of thought up-front - e.g. not re-ordering grammar.js where unnecessary and keeping the corpus files in their old location so we can diff (rename/move breaks after a certain amount of change).

Are there regressions with existing parsing? I can see a large list of exclusions on the CI but it's unclear if these are new.

@amaanq
Copy link
Member Author

amaanq commented May 1, 2024

Well I wasn't aware I had to discuss making large improvements beforehand, nor have I ever done so in any other upstream grammar I maintain.

The moving of tests to a different dir is necessary, top-level corpus dirs are unsupported upstream now and must be in test/corpus.

The exclusion list is either a parse error where it genuinely was an error (beforehand as well) or it has funky usage of preproc ifs, e.g. in the middle of an if/else statement. I have (imo) improved them such that their contents are children of the preproc_if node, much like how C does it, however, this requires that the contents inside be somewhat correctly formed (e.g. a regular statement/expression, wherever the preproc_if is applicable). I think this is a reasonable tradeoff for much more navigable trees.

Tamás politely and graciously pointed out a couple of mistakes regarding some funky dynamic precedence, which I was a little taken aback by as I explained earlier, but those are now fixed.

@amaanq amaanq mentioned this pull request May 1, 2024
@amaanq amaanq force-pushed the rewrite branch 4 times, most recently from 2d1c659 to 74ae67c Compare May 1, 2024 20:33
@amaanq
Copy link
Member Author

amaanq commented May 1, 2024

@damieng @tamasvajk the diff for tests should be much more readable now

@amaanq amaanq force-pushed the rewrite branch 2 times, most recently from f290108 to eb18043 Compare May 1, 2024 20:58
@@ -83,119 +89,114 @@ file class A {}
(class_declaration
(modifier)
name: (identifier)
body: (declaration_list))
(declaration_list))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the body field was fine, but I don't know if there was some drawback to it. Curious if you were seeing significant code size increases due to that field.

Copy link
Collaborator

@damieng damieng May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few of these removed in places that @tamasvajk added (not just body but all sorts of fields) - are they used by GH Semantic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fields are mostly used by GH CodeQL. @hvitved knows best how much we need them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having field names in general certainly makes the life easier for Treesitter based QL extractors. Field names are used to generate methods in CodeQL, e.g. this field in the Ruby grammar gives rise to this QL predicate, which is nicer to use than something like getChild(0).

@maxbrunsfeld
Copy link
Contributor

Hope you're able to reduce the state count. In general, a great way to do that is avoid long sequences with many variations.

In other words, turn this:

seq(
  optional('foo'),
  optional('bar'),
  optional('baz'),
  optional('quux'),
  // ...
)

into this:

seq(
  repeat(choice('foo', 'bar', 'baz', 'quux')),
  // ...
)

It seems like there may be opportunities to do some of that in this grammar - there may be unnecessarily-specific sequences that could be modeled more generically.

(block
(return_statement
(identifier))))))))
(anonymous_method_expression
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it we're losing "static" as a (modifier) here but we're keeping static as a modifier on the lambda equivalent on line 1262?

Copy link
Member Author

@amaanq amaanq May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that back for consistency, but what about static/unsafe in using_directive? I also (personally) think it's ok to hide modifiers, and just expose the string literals in the tree instead, but I don't have much of a preference.

@damieng
Copy link
Collaborator

damieng commented May 3, 2024

This is shaping up great! If we can get attribute_argument back in then I'm good on the Roslyn alignment/back compat from the naming side.

I think we just need to check/adjust the changes and removals of fields to make sure Semantic doesn't break and we should be good to go.

@dcreager
Copy link
Contributor

dcreager commented May 3, 2024

I think we just need to check/adjust the changes and removals of fields to make sure Semantic doesn't break and we should be good to go.

Thanks for checking in, 👍 from our side! We're pinned to the current release so we won't silently upgrade, and we're using the syntax highlighting queries directly from this repo, which I see are updated as part of this PR. We have augmented the tagging queries with the ability to build up scoped names, which we'll have to update as part of bumping to the new version containing these changes. But the changes to tags.scm in this PR look manageable, so I don't consider that a blocker.

@amaanq amaanq requested a review from damieng May 3, 2024 20:56
@amaanq amaanq merged commit 437e89c into master May 3, 2024
5 checks passed
@amaanq amaanq deleted the rewrite branch May 3, 2024 22:42
@amaanq
Copy link
Member Author

amaanq commented May 3, 2024

thanks for the review/feedback @maxbrunsfeld @damieng @tamasvajk @hvitved!

@hvitved
Copy link
Contributor

hvitved commented May 12, 2024

  • Improved preproc calls by having content be nested inside these nodes, but this does break the case where it's in the middle of an if statement for example, similar to C. I think the tradeoff is worth it, only ~80 files out of 8k+ files fail because of this, and we get much nicer parse trees in the common case, which, e.g., improves code folding. Another added benefit is we can use preproc rules in spots where only that rule itself is valid, and not everything. e.g. a preproc containing expressions is valid inside expressions, but this won't be done at the top level.

Is there any way, using extras as before this PR, to still tolerate #ifs that happen inside expressions or statements? For example, this file appears to have more parse errors now than before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants