Description
Name and Version
$ ./build-nixos-debug/bin/llama-cli --version
version: 5023 (10395195)
built with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
Test code, Other (Please specify in the next section)
Command line
llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt
Problem description
Original issue: LostRuins#1453
Using Gemma 3 model, I found llama.cpp spend a lot of time on conversation of 200+ turns. Profiling results show it mostly because of std::string::find()
in tokenizer_st_partition()
.
Step to reproduce
With llama-tokenize
I can confirm it do happens in the tokenization phase, specifically when parsing special tokens:
# 200 turns, parse special vs no parse special
$ printf '<start_of_turn>user\nhello<end_of_turn>\n%.0s' {1..200} > prompt.txt
$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt
3.08user 0.22system 0:03.54elapsed 93%CPU (0avgtext+0avgdata 267304maxresident)
$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt --no-parse-special
0.27user 0.28system 0:00.81elapsed 68%CPU (0avgtext+0avgdata 267436maxresident)
# 400 turns, parse special vs no parse special
$ printf '<start_of_turn>user\nhello<end_of_turn>\n%.0s' {1..400} > prompt.txt
$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt
11.53user 0.29system 0:12.08elapsed 97%CPU (0avgtext+0avgdata 266932maxresident)
$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt --no-parse-special
0.27user 0.31system 0:00.89elapsed 65%CPU (0avgtext+0avgdata 267108maxresident)
Possible Cause
tokenizer_st_partition()
partitions the original prompt by each special tokens. If the previous special token breaks the prompt into multiple parts, the next special token will be searched in these new parts one by one.
For most model there won't be too much special tokens, so the process is fast. However, there are a lot of <unusedXXX>
special tokens in Gemma 3 after common special tokens like <start_of_turn>
. As a result we are calling std::string::find()
much more times.
Hacking around the issue
By patching the compare function
to always sort tokens starting with <unused
before all other special tokens, I can speed up the tokenization time to 0.29
s.
Possible solution
To reduce the total executions of std::string::find()
, maybe we can sort these tokens by their number of occurrences in the raw text? Or what about search all these tokens in the raw text, mark their locations and finish the partition process in a single iteration?
However, this is not the root cause of why these find
s are so slow. I thought #12706 could solve the problem, but after I tried it does not. It seems the PR #12706 is the root cause.
First Bad Commit
No response