Misc. bug: `tokenizer_st_partition`: Slow tokenization time on 200+ turns of conversation with Gemma 3

### Name and Version

```console
$ ./build-nixos-debug/bin/llama-cli --version
version: 5023 (10395195)
built with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

Test code, Other (Please specify in the next section)

### Command line

```shell
llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt
```

### Problem description

Original issue: https://github.com/LostRuins/koboldcpp/issues/1453

Using Gemma 3 model, I found llama.cpp spend a lot of time on conversation of 200+ turns. Profiling results show it mostly because of [`std::string::find()` in `tokenizer_st_partition()`](https://github.com/ggml-org/llama.cpp/blob/c80a7759dab10657b9b6c3e87eef988a133b9b6a/src/llama-vocab.cpp#L2223).

### Step to reproduce

With `llama-tokenize` I can confirm it do happens in the tokenization phase, specifically when parsing special tokens:

```console
# 200 turns, parse special vs no parse special
$ printf '<start_of_turn>user\nhello<end_of_turn>\n%.0s' {1..200} > prompt.txt

$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt
3.08user 0.22system 0:03.54elapsed 93%CPU (0avgtext+0avgdata 267304maxresident)
$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt --no-parse-special
0.27user 0.28system 0:00.81elapsed 68%CPU (0avgtext+0avgdata 267436maxresident)

# 400 turns, parse special vs no parse special
$ printf '<start_of_turn>user\nhello<end_of_turn>\n%.0s' {1..400} > prompt.txt

$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt
11.53user 0.29system 0:12.08elapsed 97%CPU (0avgtext+0avgdata 266932maxresident)
$ time llama-tokenize --model google_gemma-3-27b-it-Q6_K.gguf --file prompt.txt --no-parse-special
0.27user 0.31system 0:00.89elapsed 65%CPU (0avgtext+0avgdata 267108maxresident)
```

### Possible Cause

`tokenizer_st_partition()` partitions the original prompt by each special tokens. If the previous special token breaks the prompt into multiple parts, the next special token will be searched in these new parts one by one.

For most model there won't be too much special tokens, so the process is fast. However, there are a lot of `<unusedXXX>` special tokens in Gemma 3 **after** common special tokens like `<start_of_turn>`. As a result we are calling `std::string::find()` much more times.

### Hacking around the issue

By patching [`the compare function`](https://github.com/ggml-org/llama.cpp/blob/c80a7759dab10657b9b6c3e87eef988a133b9b6a/src/llama-vocab.cpp#L2006) to always sort tokens starting with `<unused` before all other special tokens, I can speed up the tokenization time to `0.29`s. 

### Possible solution

To reduce the total executions of `std::string::find()`, maybe we can sort these tokens by their number of occurrences in the raw text? Or what about search all these tokens in the raw text, mark their locations and finish the partition process in a single iteration?

However, this is not the root cause of why these `find`s are so slow. ~~I thought https://github.com/ggml-org/llama.cpp/pull/12706 could solve the problem, but after I tried it does not.~~ It seems the PR https://github.com/ggml-org/llama.cpp/pull/12706 is the root cause.

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: `tokenizer_st_partition`: Slow tokenization time on 200+ turns of conversation with Gemma 3 #12724

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description

Step to reproduce

Possible Cause

Hacking around the issue

Possible solution

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: tokenizer_st_partition: Slow tokenization time on 200+ turns of conversation with Gemma 3 #12724

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description

Step to reproduce

Possible Cause

Hacking around the issue

Possible solution

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Misc. bug: `tokenizer_st_partition`: Slow tokenization time on 200+ turns of conversation with Gemma 3 #12724