Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some precompiled grammars aren't showing correct highlighting #918

Open
slevithan opened this issue Feb 4, 2025 · 5 comments
Open

Some precompiled grammars aren't showing correct highlighting #918

slevithan opened this issue Feb 4, 2025 · 5 comments
Labels
bug Something isn't working engine-js

Comments

@slevithan
Copy link
Collaborator

slevithan commented Feb 4, 2025

Some but not all JS raw (precompiled) grammars lead to incorrect highlighting. Examples where this is happening include python, html, perl, and yaml.

Compare the following two screenshots for highlighting Shiki's Python sample:

Using the WASM or JS engine (correct)

Image

Using the JS Raw engine with the precompiled grammar (incorrect)

Image

This gives the same broken result for Python with all the versions I tested (Shiki 2.3.1, 2.3.0, 2.2.0, and 2.0.3).

@slevithan slevithan added bug Something isn't working engine-js labels Feb 4, 2025
@slevithan slevithan changed the title Some grammars supported by the JS engine aren't showing correct highlighting Some precompiled grammars aren't showing correct highlighting Feb 5, 2025
@slevithan slevithan reopened this Feb 5, 2025
@antfu
Copy link
Member

antfu commented Feb 5, 2025

That's indeed weird. Can you reproduce that in the tests as well?

@slevithan
Copy link
Collaborator Author

slevithan commented Feb 5, 2025

Can you reproduce that in the tests as well?

Yes.

I just updated engine-javascript/test/raw.test.ts to additionally import @shikijs/langs-precompiled/python and then added the following test:

  expect(
    shiki.codeToHtml('"""test"""', { lang: 'python', theme: 'vitesse-light' }),
  )
    .toMatchInlineSnapshot(`"<pre class="shiki vitesse-light" style="background-color:#ffffff;color:#393a34" tabindex="0"><code><span class="line"><span style="color:#B5695977">"""</span><span style="color:#B56959">test</span><span style="color:#B5695977">"""</span></span></code></pre>"`)

That inline snapshot is from the non-raw JS engine.

The test failed with this error:

Expected: ""<pre class="shiki vitesse-light" style="background-color:#ffffff;color:#393a34" tabindex="0"><code><span class="line"><span style="color:#B5695977">"""</span><span style="color:#B56959">test</span><span style="color:#B5695977">"""</span></span></code></pre>""

Received: ""<pre class="shiki vitesse-light" style="background-color:#ffffff;color:#393a34" tabindex="0"><code><span class="line"><span style="color:#B5695977">"""</span><span style="color:#B56959">test"""</span></span></code></pre>""

Just an idea: Is it possible that some regexes (perhaps due to some character in their source) are not being serialized accurately into the precompiled output?

@slevithan
Copy link
Collaborator Author

slevithan commented Feb 6, 2025

By looking at the Python grammar's begin/end regexes for docstrings (which is where the problem starts in the screenshot and the test above), I think I've realized the cause of the problem.

vscode-textmate has a hacky system for replacing things that look like numbered backreferences in end/while patterns with the text actually matched by that capture number in the start pattern. In this way, an end/while pattern can refer back to something matched in its start pattern.

Since AFAIK the patterns don't go through this replacement step (that uses dynamic subpatterns matched at runtime) for precopmpiled grammars, they don't work correctly.

Alas, this means precompiled end/while regexes need some fundamental change. Either they shouldn't get precompiled at all, or maybe the JS Raw engine could manually reproduce the runtime replacement that vscode-textmate does.

It would fix the problem if we stopped precompiling any end/while patterns that include a match of JS regex /\\\d/ (which is a deeply flawed check for numbered backrefs for several reasons, but it's what vscode-textmate looks for). But if we wanted to avoid that, there are multiple additional problems.

Behavior for backreferences to nonparticipating capturing groups (NPCGs)

There's a difference between Oniguruma and JavaScript behavior for backreferences to NPCGs.

  • In JavaScript, due to a regex engine design mistake that's been there since the beginning (ES3), any backreference to a capture that hasn't (yet) participated in the match matches the empty string.
  • In every other modern regex flavor (including Oniguruma), such backreferences fail to match, causing the regex engine to backtrack or fail.

More context:

  • TextMate grammars, through the feature of sharing backreferences across patterns (IMO this was a really bad idea), invites TM grammar authors to write end/while patterns that are invalid Oniguruma regexes (with backreferences that refer back to nothing and would be an error, or that reference a capture in their own pattern that will not be respected).
  • Oniguruma-To-ES enables this TM system to work via its rules.allowOrphanBackrefs option, which not only suppresses the validation for numbered backreferences to captures that don't exist, but also adds the necessary number of additional captures needed at the end of the regex to make it so that the backreference isn't an error in JS (when using flag u/v).

So far, no problem. But because of the Oniguruma/JS difference in NPCG handling, backreferences to captures that can be determined at compile time to not be able to participate at that point in the regex are converted to (?!), which never matches. This is so that it follows the Oniguruma behavior. Actually there are multiple potential outcomes depending on how a backref appears:

  • It's a validation error when the referenced capturing group in the original pattern is to the right of the backreference (but this error is suppressed for numbered backrefs by allowOrphanBackrefs).
  • Duplicate group names (and subroutines) to the right of the backreference are not included in the backreference's multiplexing. (Backref multiplexing is an Oniguruma-specific feature where backrefs to duplicate group names rematch any of the values matched by groups of that name.)
  • The backreference fails to match (or leaves out of multiplexing) ancestor groups and groups in preceding alternation paths.

It's not important to grok all of this. At the end of the day, it just means that Oniguruma-To-ES is amazingly able to match Oniguruma's behavior for NPCGs in native JS regexes. But it's relevant because it means that the numbered backref that vscode-textmate will look for (in its hacky match-time pattern replacement) might not be present in the generated pattern.

A real example is the (\1) that the Python grammar uses as its end pattern for docstrings. To match Oniguruma behavior in JS, this gets converted to ((?!)). Now there's no '\\1' to be replaced by vscode-textmate. This is because there's an existing capturing group 1 in this pattern which hasn't yet finished participating when we see the \1 backreference to it.

It wouldn't help to add a property like _rawSource to all generated regexes and then have vscode-textmate (or some inlined code in the JS Raw engine) do a replacement on that, because the raw source is an Oniguruma pattern so we wouldn't be able to use it anyway in a precompiled grammar that doesn't do runtime transpilation.

I could update the allowOrphanBackrefs option to also not convert NPCG backrefs to (?!), and then hope for the best: that no real-world grammars rely on accurate reproduction of this Oniguruma behavior (which is probably true, since only deep regex experts would intentionally rely on it -- most people aren't even aware of the difference). I think this change would be fine to make, and we'd know right away via the JS engine report whether this causes a problem for any of the Shiki language samples. With this change to the allowOrphanBackrefs option, all of the numbered backreferences would remain in place, so could be targeted by the search and replace that TextMate engines do. And then the updated pattern with merged backrefs could just be passed to RegExp, without needing runtime transpilation.

That much I'd be comfortable with, but then there's another issue.

Searching in the generated JS pattern rather than the original Oniguruma pattern

Oniguruma-To-ES uses numbered backreferences for generated source even when the original pattern used named backreferences. This could lead to TextMate's backref-merging system replacing backreferences that it shouldn't (since vscode-textmate only looks for numbered backrefs, not named). And this isn't fully avoidable due to a variety of complexities (e.g. duplicate group names, backref multiplexing, recursion). Things might still end up okay for all of the current samples/grammars (I'm not sure), but we'd be introducing a potential bug vector, due to doing backref replacements on generated JS regex source rather than the original Oniguruma pattern.

I have ideas on how this bug vector could be avoided, but it would introduce several unfortunate layers of complexity in Oniguruma-To-ES and the JS Raw engine.


The only alternative to all of this seems to be runtime transpilation of end/while patterns that contain numbered backrefs, which would lose a meaningful chunk (but not all) of the benefit of precompilation for some (but not all) grammars. That's because it would require importing all of oniguruma-to-es rather than just its EmulatedRegExp class for the grammars that this affects.

But yeah, right now a lot of the precompiled grammars (more than a third) are broken because of this deep flaw (sometimes severely broken). To see the scale of the problem--i.e., the number of grammars relying on this unfortunate backref-merging feature--you can change allowOrphanBackrefs to false in engine-compile.js and then run the JS engine compat report. The number of supported grammars drops from 217 currently to 137. And this is possibly underestimating, since in cases where there are enough capturing groups in the end/while pattern it won't lead to an error.

What do you think? I'd be happy to update the Oniguruma-To-ES option I mentioned if you want to try going down that path and avoid runtime transpilation, but I'd rely on you for updates to the precompiled grammar system (doing the runtime replacement on generated regex source and passing that to RegExp).

Prior to hearing your thoughts/ideas, my recommendation would be to bite the bullet and do runtime transpilation for the small number of affected regexes in the 80+ precompiled grammars that are affected. (I'd also rely on you for this.)

...Or to not provide precompiled grammars for the 80+ that are currently broken. (Not my preference, but that might be needed anyway in the short term if will take longer to implement a solution.)

What do you think?

@RedCMD
Copy link

RedCMD commented Feb 6, 2025

both vscode-textmate and TextMate 2.0 physically replace the backreferences in end/while with the text captured in begin
they do both escape most of regex's meta characters
but there are some differences
vscode-textmate TextMate 2.0 bug report

vscode-textmate:
replaces any backreference in the form \\[0-9]+
including \\00009999999
this does mean that \\0 (and \\0000000) matches the entire begin text and not the octal escape null
any backreferences that do not capture text or do not exist (in the begin regex only!!) are simply replaced with nothing (not (?!))

TextMate 2.0:
only replaces backreferences \\[0-9], backreferences 0 to 9 (10 total)
\0 backreferences the begin while \00 is a backreference \0 and literal 0
https://github.com/textmate/textmate/blob/master/Frameworks/parse/src/parse.cc#L136

begin = 'abc',
end = '\12'

the backference \12 matches the text 2
if you change abc to ab(c), then it matches c2
while vscode matches nothing. cause capture group 12 doesn't exist


you can then do things like this

"begin": "[0-9]+",
"end": "a{\\0}"

which will match any number then proceed to match that many a's sometime after it

I've used this in my YAML grammar for numeric indented block-scalars

"begin": "(?>(\\|)|(>))(?<chomp>[+-])?+([1-9])(?(<chomp>)|\\g<chomp>)?+",
"while": "\\G(?> {\\4}| *+($|[^#]))"

or switching between atomic and non-capturing

"begin": "([:>])",
"end": "(?\\1abc)"

@slevithan
Copy link
Collaborator Author

slevithan commented Feb 6, 2025

Thanks—it's definitely helpful to know the precise details. However, just so we're clear for other people reading along, none of that is directly related to the what I posted. E.g., my mention of replacing some NPCG backrefs with (?!) during transpilation was not related to anything vscode-textmate is doing.

both vscode-textmate and TextMate 2.0 physically replace the backreferences in end/while with the text captured in begin they do both escape most of regex's meta characters but there are some differences vscode-textmate TextMate 2.0 bug report

Funny enough, I recognize the list of chars vscode-textmate is escaping. It comes (directly or indirectly) from my XRegExp library, which has used exactly that list going back more than 15 years. It's the relevant list for XRegExp's regex flavor, but not precisely appropriate for native JS RegExp or Oniguruma.

Aside: Your vscode-textmate bug report spells out the \s whitespace chars as space, tab, and line feed. But since it's a JS \s in vscode-textmate's source, it matches much more than that. JS's \s matches everything in \p{White_Space}, plus U+FEFF, minus U+0085 (which is different from Oniguruma's \s in those latter two details).

"begin": "[0-9]+",
"end": "a{\\0}"
"begin": "(?>(\\|)|(>))(?<chomp>[+-])?+([1-9])(?(<chomp>)|\\g<chomp>)?+",
"while": "\\G(?> {\\4}| *+($|[^#]))"
"begin": "([:>])",
"end": "(?\\1abc)"

Oh my god, that's creative but cursed. 😆

IMO it's a bad idea to rely on using this in ways that wouldn't work if the search for backrefs got smarter. These examples are cool, but are (intentionally) subverting the intention of undocumented (and poorly designed) behavior that could change. You're also preventing these patterns from ever being evaluated/validated as regexes, without first going through TM-specific runtime mangling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working engine-js
Projects
None yet
Development

No branches or pull requests

3 participants