-
-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some precompiled grammars aren't showing correct highlighting #918
Comments
That's indeed weird. Can you reproduce that in the tests as well? |
Yes. I just updated expect(
shiki.codeToHtml('"""test"""', { lang: 'python', theme: 'vitesse-light' }),
)
.toMatchInlineSnapshot(`"<pre class="shiki vitesse-light" style="background-color:#ffffff;color:#393a34" tabindex="0"><code><span class="line"><span style="color:#B5695977">"""</span><span style="color:#B56959">test</span><span style="color:#B5695977">"""</span></span></code></pre>"`) That inline snapshot is from the non-raw JS engine. The test failed with this error:
Just an idea: Is it possible that some regexes (perhaps due to some character in their source) are not being serialized accurately into the precompiled output? |
By looking at the Python grammar's
Since AFAIK the patterns don't go through this replacement step (that uses dynamic subpatterns matched at runtime) for precopmpiled grammars, they don't work correctly. Alas, this means precompiled It would fix the problem if we stopped precompiling any Behavior for backreferences to nonparticipating capturing groups (NPCGs)There's a difference between Oniguruma and JavaScript behavior for backreferences to NPCGs.
More context:
So far, no problem. But because of the Oniguruma/JS difference in NPCG handling, backreferences to captures that can be determined at compile time to not be able to participate at that point in the regex are converted to
It's not important to grok all of this. At the end of the day, it just means that Oniguruma-To-ES is amazingly able to match Oniguruma's behavior for NPCGs in native JS regexes. But it's relevant because it means that the numbered backref that A real example is the It wouldn't help to add a property like I could update the That much I'd be comfortable with, but then there's another issue. Searching in the generated JS pattern rather than the original Oniguruma patternOniguruma-To-ES uses numbered backreferences for generated source even when the original pattern used named backreferences. This could lead to TextMate's backref-merging system replacing backreferences that it shouldn't (since I have ideas on how this bug vector could be avoided, but it would introduce several unfortunate layers of complexity in Oniguruma-To-ES and the JS Raw engine. The only alternative to all of this seems to be runtime transpilation of But yeah, right now a lot of the precompiled grammars (more than a third) are broken because of this deep flaw (sometimes severely broken). To see the scale of the problem--i.e., the number of grammars relying on this unfortunate backref-merging feature--you can change What do you think? I'd be happy to update the Oniguruma-To-ES option I mentioned if you want to try going down that path and avoid runtime transpilation, but I'd rely on you for updates to the precompiled grammar system (doing the runtime replacement on generated regex source and passing that to Prior to hearing your thoughts/ideas, my recommendation would be to bite the bullet and do runtime transpilation for the small number of affected regexes in the 80+ precompiled grammars that are affected. (I'd also rely on you for this.) ...Or to not provide precompiled grammars for the 80+ that are currently broken. (Not my preference, but that might be needed anyway in the short term if will take longer to implement a solution.) What do you think? |
both
TextMate 2.0: begin = 'abc',
end = '\12' the backference you can then do things like this "begin": "[0-9]+",
"end": "a{\\0}" which will match any number then proceed to match that many I've used this in my YAML grammar for numeric indented block-scalars "begin": "(?>(\\|)|(>))(?<chomp>[+-])?+([1-9])(?(<chomp>)|\\g<chomp>)?+",
"while": "\\G(?> {\\4}| *+($|[^#]))" or switching between atomic and non-capturing "begin": "([:>])",
"end": "(?\\1abc)" |
Thanks—it's definitely helpful to know the precise details. However, just so we're clear for other people reading along, none of that is directly related to the what I posted. E.g., my mention of replacing some NPCG backrefs with
Funny enough, I recognize the list of chars Aside: Your
Oh my god, that's creative but cursed. 😆 IMO it's a bad idea to rely on using this in ways that wouldn't work if the search for backrefs got smarter. These examples are cool, but are (intentionally) subverting the intention of undocumented (and poorly designed) behavior that could change. You're also preventing these patterns from ever being evaluated/validated as regexes, without first going through TM-specific runtime mangling. |
Some but not all JS raw (precompiled) grammars lead to incorrect highlighting. Examples where this is happening include
python
,html
,perl
, andyaml
.Compare the following two screenshots for highlighting Shiki's Python sample:
Using the WASM or JS engine (correct)
Using the JS Raw engine with the precompiled grammar (incorrect)
This gives the same broken result for Python with all the versions I tested (Shiki 2.3.1, 2.3.0, 2.2.0, and 2.0.3).
The text was updated successfully, but these errors were encountered: