-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use string_view::find() to search for tokenization to speed up #12706
Conversation
…ry looking up beyond the fragment range
Could you give some sample numbers to get a feeling about the perf impact of this change? |
@KiruyaMomochi shared the issue to me to ask for help with the information ' In my test on WSL2 Ubuntu 24.04 with gcc 13.3.0 @ AMD 5950x, building with only To reproduce it:
I also tried changing |
My bad, it works like a charm. I didn't recompile after applying the patch. Also, @LostRuins from LostRuins#1453. |
// check if match is within bounds of offset <-> length | ||
if (match + text.length() > raw_text_base_offset + raw_text_base_length) break; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following why this check is no longer needed.
If someone can confirm the correctness - feel free to merge. Otherwise I'll take a detailed look later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's checking if the match exceeds the length and drop the match if it does, as I have pointed out, it's doing this only because it searches for a match in [offset, end_of_string)
instead of the expected range [offset, offset + length)
, so this extra check is to guarantee the behavior is same as searching within the expected range.
now we use string_view::find()
to search the expected range directly hence we dont need this extra check.
The old implementation using
std::string::find()
consumes a lot of time if the string is very long, because it searches beyond the fragment range, looking up for tokens after the fragment end (and later drop the match result if it is beyond the range).O(n) wasted.
Furthermore, a long string produces more fragments and for each fragment, then a token is searched again and again in the fragments, i.e. It searches within
[offset1, end) [offset2, end) [offset3, end) ...
where offsetn is the nth of the fragments.So actually O(n^2) wasted.
Hope this pic helps understand the issue more easily.

By limiting the search area (esp. the end) using
std::string_view::find()
, we can avoid such unnecessary looking up.This PR contains the minimal code change to solve the performance pitfall, to at least make it work.
Maybe someone should refactor the whole fragment stuff with
std::string_view
for code readability in the future.