Skip to content

Misc. bug: tokenizer_st_partition: Slow tokenization time on 200+ turns of conversation with Gemma 3 #12724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KiruyaMomochi opened this issue Apr 2, 2025 · 1 comment

Comments

@KiruyaMomochi
Copy link

KiruyaMomochi commented Apr 2, 2025

Name and Version

$ ./build-nixos-debug/bin/llama-cli --version
version: 5023 (10395195)
built with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Test code, Other (Please specify in the next section)

Command line

llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt

Problem description

Original issue: LostRuins#1453

Using Gemma 3 model, I found llama.cpp spend a lot of time on conversation of 200+ turns. Profiling results show it mostly because of std::string::find() in tokenizer_st_partition().

Step to reproduce

With llama-tokenize I can confirm it do happens in the tokenization phase, specifically when parsing special tokens:

# 200 turns, parse special vs no parse special
$ printf '<start_of_turn>user\nhello<end_of_turn>\n%.0s' {1..200} > prompt.txt

$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt
3.08user 0.22system 0:03.54elapsed 93%CPU (0avgtext+0avgdata 267304maxresident)
$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt --no-parse-special
0.27user 0.28system 0:00.81elapsed 68%CPU (0avgtext+0avgdata 267436maxresident)

# 400 turns, parse special vs no parse special
$ printf '<start_of_turn>user\nhello<end_of_turn>\n%.0s' {1..400} > prompt.txt

$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt
11.53user 0.29system 0:12.08elapsed 97%CPU (0avgtext+0avgdata 266932maxresident)
$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt --no-parse-special
0.27user 0.31system 0:00.89elapsed 65%CPU (0avgtext+0avgdata 267108maxresident)

Possible Cause

tokenizer_st_partition() partitions the original prompt by each special tokens. If the previous special token breaks the prompt into multiple parts, the next special token will be searched in these new parts one by one.

For most model there won't be too much special tokens, so the process is fast. However, there are a lot of <unusedXXX> special tokens in Gemma 3 after common special tokens like <start_of_turn>. As a result we are calling std::string::find() much more times.

Hacking around the issue

By patching the compare function to always sort tokens starting with <unused before all other special tokens, I can speed up the tokenization time to 0.29s.

Possible solution

To reduce the total executions of std::string::find(), maybe we can sort these tokens by their number of occurrences in the raw text? Or what about search all these tokens in the raw text, mark their locations and finish the partition process in a single iteration?

However, this is not the root cause of why these finds are so slow. I thought #12706 could solve the problem, but after I tried it does not. It seems the PR #12706 is the root cause.

First Bad Commit

No response

Relevant log output

@KiruyaMomochi
Copy link
Author

The PR #12706 works really well.
I'm sorry about earlier, I forget to rebuild after applying the patch.

Repeating <end_of_turn>\n for 500 times:

before #12706: 4.66s
after #12706: 0.24s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant