Misc. bug: tokenizer_st_partition
: Slow tokenization time on 200+ turns of conversation with Gemma 3
#12724
Labels
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
Test code, Other (Please specify in the next section)
Command line
Problem description
Original issue: LostRuins#1453
Using Gemma 3 model, I found llama.cpp spend a lot of time on conversation of 200+ turns. Profiling results show it mostly because of
std::string::find()
intokenizer_st_partition()
.Step to reproduce
With
llama-tokenize
I can confirm it do happens in the tokenization phase, specifically when parsing special tokens:Possible Cause
tokenizer_st_partition()
partitions the original prompt by each special tokens. If the previous special token breaks the prompt into multiple parts, the next special token will be searched in these new parts one by one.For most model there won't be too much special tokens, so the process is fast. However, there are a lot of
<unusedXXX>
special tokens in Gemma 3 after common special tokens like<start_of_turn>
. As a result we are callingstd::string::find()
much more times.Hacking around the issue
By patching
the compare function
to always sort tokens starting with<unused
before all other special tokens, I can speed up the tokenization time to0.29
s.Possible solution
To reduce the total executions of
std::string::find()
, maybe we can sort these tokens by their number of occurrences in the raw text? Or what about search all these tokens in the raw text, mark their locations and finish the partition process in a single iteration?However, this is not the root cause of why these
find
s are so slow.I thought #12706 could solve the problem, but after I tried it does not.It seems the PR #12706 is the root cause.First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: