Skip to content

llama-bench: enhance benchmark with improved token throughput measurements #12874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

thevishalagarwal
Copy link

@thevishalagarwal thevishalagarwal commented Apr 10, 2025

This PR adds separate measurements for prompt processing and token generation throughput in llama-bench. The changes allow for more detailed performance analysis by separately tracking and reporting:

  • Prompt processing throughput (prompt t/s)
  • Token generation throughput (gen t/s)

The current implementation of t/s throughput metric is incorrect when -pg flag is specified. It uses the formula (n_prompt+n_gen)/e2e_time which does not accurately represent throughput and leads to misleading interpretation.

Benefits

  • More accurate and granular performance metrics
  • Better visibility into prompt processing vs token generation performance

Old output

> .llama-bench.exe -m C:\drive\models\gguf\Qwen2.5-0.5B-Instruct-Q4_K_M.gguf -pg 512,128 -pg 1000,200
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | CUDA       |  99 |         pp512 |    39230.93 ± 650.08 |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | CUDA       |  99 |         tg128 |       496.01 ± 17.34 |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | CUDA       |  99 |   pp512+tg128 |       2292.94 ± 4.15 |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | CUDA       |  99 |  pp1000+tg200 |      2644.50 ± 14.50 |

New output

> .\llama-bench.exe -m C:\drive\models\gguf\Qwen2.5-0.5B-Instruct-Q4_K_M.gguf -pg 512,128 -pg 1000,200
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
| model                          |     params | backend    | ngl |          test |         prompt t/s |         gen t/s |
| ------------------------------ | ---------: | ---------- | --: | ------------: | -----------------: | --------------: |
| qwen2 1B Q4_K - Medium         |   494.03 M | CUDA       |  99 |         pp512 |  38748.35 ± 423.86 |     0.00 ± 0.00 |
| qwen2 1B Q4_K - Medium         |   494.03 M | CUDA       |  99 |         tg128 |        0.00 ± 0.00 |  491.16 ± 10.18 |
| qwen2 1B Q4_K - Medium         |   494.03 M | CUDA       |  99 |   pp512+tg128 | 38858.78 ± 2789.38 |   467.41 ± 9.05 |
| qwen2 1B Q4_K - Medium         |   494.03 M | CUDA       |  99 |  pp1000+tg200 |  38071.20 ± 956.56 |   462.01 ± 5.57 |

@thevishalagarwal thevishalagarwal changed the title Enhance llama-bench with improved token throughput measurements llama-bench: enhance benchmark with improved token throughput measurements Apr 10, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal opinion is that a rate of tokens over both prompt processing and token generation is not a useful metric. This is because you are calculating the average of two clearly different phases of execution. I think a better metric would be just the total runtime of the test. Related discussion: #7199 . In any case, I think the way the information is presented with this PR is an improvement over master and I would still be willing to review and merge it unless someone else objects.

Other considerations:

  • With these changes the documentation in the README file has become outdated, please update it prior to merging.
  • The line width of the default prints is becoming too long I think. I would be fine with dropping the model size and number of parameters.
  • I assume this Pr will have broken scripts/compare_llama_bench.py. It would be nice if this was fixed but I'm also fine with doing the fix myself.

@0cc4m
Copy link
Collaborator

0cc4m commented Apr 13, 2025

I agree that there is room for improvement here. I have never used the pp+tg tests because the output didn't give me any useful information, so I would like that to change. The way this is handled by other applications is by calculating the combined pp+tg number as the total test time divided by the number of tokens generated. This gives you a useful metric of how fast you can generate tokens in back-to-back requests with a specific prompt size to process each time.

I don't think we should extend the table with separate pp and tg t/s counts, since the default tests keep them on separate rows anyways. That would only make sense if we wanted to default to pp+tg tests (which can also be discussed).

@JohannesGaessler
Copy link
Collaborator

The way this is handled by other applications is by calculating the combined pp+tg number as the total test time divided by the number of tokens generated. This gives you a useful metric of how fast you can generate tokens in back-to-back requests with a specific prompt size to process each time.

I disagree, that is in my view not a useful metric for comparison because the value that the rate is normalized to doesn't make sense.

I don't think we should extend the table with separate pp and tg t/s counts, since the default tests keep them on separate rows anyways. That would only make sense if we wanted to default to pp+tg tests (which can also be discussed).

What I think would be useful as a default for a table is generating some amount of tokens on an empty context and then the same amount of tokens with a non-empty context. From that you can roughly estimate both the maximum speed and how that speed declines with more context.

What I would think would be best but also high-effort would be to first record the prompt processing and generation evaluation times in a differential way. Then, in a second step you could fit a polynomial to the runtime as a function of context size and plot the results. A t/s value as a function of context size can be obtained by transforming the y axis.

@thevishalagarwal
Copy link
Author

thevishalagarwal commented Apr 14, 2025

My personal opinion is that a rate of tokens over both prompt processing and token generation is not a useful metric. This is because you are calculating the average of two clearly different phases of execution. I think a better metric would be just the total runtime of the test.

Thanks for the review. I agree that e2e t/s is not a very useful metric. Separate pp and tg metrics are more useful to understand as these are two distinct phases. Total runtime is also not very helpful IMO since the runtime will vary based on the prompt length and the number of tokens generated and the total runtime doesn't give much insight on perf for either prompt or generation phase.

Instead of total runtime, a better metric is time to first token (TTFT). This is alternative to pp t/s. We can use TTFT if no one has any objection.

IMO, the separate pp and tg tests doesn't make sense either. However, we should keep pp+tg tests as default (if others agree). This is also consistent with other LLM-related libraries.

My final recommendation will be

  • use TTFT and tg t/s as metrics
  • remove separate pp and tg tests as default. Prefer pp+tg tests

@JohannesGaessler
Copy link
Collaborator

use TTFT and tg t/s as metrics

No, I think for pp and tg on their own it makes more sense to provide t/s instead of the runtime, I only think it doesn't make sense to provide a t/s value for a mixture of pp and t/g.

Co-authored-by: Johannes Gäßler <[email protected]>
@thevishalagarwal
Copy link
Author

Separate metrics for pp/tg and pp+tg tests is confusing and I don't think we should do that

@0cc4m
Copy link
Collaborator

0cc4m commented Apr 15, 2025

The way this is handled by other applications is by calculating the combined pp+tg number as the total test time divided by the number of tokens generated. This gives you a useful metric of how fast you can generate tokens in back-to-back requests with a specific prompt size to process each time.

I disagree, that is in my view not a useful metric for comparison because the value that the rate is normalized to doesn't make sense.

Why? Tokens generated is the metric that the user cares about. Sure, it's less relevant than splitting up the metrics, but it is not useless.

I agree with a text generation test for empty and full context to get min and max expected speeds. A graph is even better, but would take too long to measure to make it the default.

@JohannesGaessler
Copy link
Collaborator

Why? Tokens generated is the metric that the user cares about. Sure, it's less relevant than splitting up the metrics, but it is not useless.

The relevant metrics for a good user experience as I see them are a low latency until the first token is generated and a high rate of tokens during generation. But because the initial latency is relative to the length of the prompt it makes more sense to instead provide a rate at which tokens are processed. On a fundamental level, if more metrics are to be added they need to be justified in some way, either by providing useful information on their own or by facilitating comparisons. I don't see a situation where a rate of tokens relative to the runtime of pp + tg is ever useful information in isolation. And for comparisons of some pp + tg runs the total runtime is a better metric because lower/higher values correlate better with better/worse performance.

@thevishalagarwal
Copy link
Author

Updated the PR with prompt and gen t/s only and also updated README

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR in its current state is good and a straight upgrade over master. To take better advantage of the new functionality I would suggest changing the default mode to -r 3 -p 0 -n 32 -pg 4096,32. In my testing ~4096 context is enough to get you good sensitivity to the slowdown of a filled context and it's still sufficiently precise. One concern could be that if pp is slow then these defaults would take comparatively longer but for the backends that I work with I think this will not be an issue.

Example output:

johannesg@johannes-romed82t-00 ~/Projects/llama.cpp                                                                                                                  [11:43:35]
> $ ./bench --model models/opt/${model_name}-${quantization}.gguf -fa 0,1 -r 3 -p 0 -n 32 -pg 4096,32                                                         [±fa6cb8aec ●(✹)]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |     params | backend    | ngl | fa |          test |         prompt t/s |         gen t/s |
| ------------------------------ | ---------: | ---------- | --: | -: | ------------: | -----------------: | --------------: |
| llama 8B Q4_0                  |     8.03 B | CUDA       |  99 |  0 |          tg32 |        0.00 ± 0.00 |   161.30 ± 0.08 |
| llama 8B Q4_0                  |     8.03 B | CUDA       |  99 |  0 |   pp4096+tg32 |     7954.03 ± 4.12 |   136.15 ± 0.31 |
| llama 8B Q4_0                  |     8.03 B | CUDA       |  99 |  1 |          tg32 |        0.00 ± 0.00 |   163.38 ± 0.03 |
| llama 8B Q4_0                  |     8.03 B | CUDA       |  99 |  1 |   pp4096+tg32 |    12723.48 ± 1.46 |   145.96 ± 0.14 |

build: fa6cb8aec (5100)

@thevishalagarwal
Copy link
Author

Updated the default params

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the README to reflect the change in defaults, it's enough to just modify the commands that produced the outputs where applicable. From my end I think the changes are otherwise good to merge.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, also change what is being listed as defaults in the README.

@thevishalagarwal
Copy link
Author

Thanks for the review @JohannesGaessler

Comment on lines +114 to +121
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 128 | pp1024 | 17125.18 ± 731.13 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 128 | pp4096+tg32 | 12139.39 ± 446.63 | 378.76 ± 8.18 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 256 | pp1024 | 24112.17 ± 161.18 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 256 | pp4096+tg32 | 14508.80 ± 53.00 | 386.58 ± 0.42 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 512 | pp1024 | 25534.56 ± 368.03 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 512 | pp4096+tg32 | 15388.41 ± 13.06 | 386.30 ± 0.53 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 1024 | pp1024 | 25654.61 ± 772.86 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 1024 | pp4096+tg32 | 15487.92 ± 8.59 | 385.20 ± 0.50 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the point of the example, do the test with ./llama-bench -n 0 -pg 0,0 -p 1024 -b 128,256,512,1024 instead.

Comment on lines +131 to +148
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 1 | pp64 | 9229.99 ± 1897.41 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 1 | tg16 | 0.00 ± 0.00 | 444.33 ± 25.11 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 1 | pp4096+tg32 | 15357.53 ± 27.52 | 373.90 ± 7.03 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 2 | pp64 | 10799.57 ± 33.90 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 2 | tg16 | 0.00 ± 0.00 | 461.43 ± 10.99 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 2 | pp4096+tg32 | 15371.18 ± 57.24 | 372.59 ± 4.02 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 4 | pp64 | 11033.35 ± 177.05 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 4 | tg16 | 0.00 ± 0.00 | 448.57 ± 8.66 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 4 | pp4096+tg32 | 15371.12 ± 43.70 | 376.71 ± 0.93 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 8 | pp64 | 11206.45 ± 187.47 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 8 | tg16 | 0.00 ± 0.00 | 457.99 ± 6.92 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 8 | pp4096+tg32 | 15022.14 ± 161.68 | 369.76 ± 4.71 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 16 | pp64 | 10397.19 ± 304.08 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 16 | tg16 | 0.00 ± 0.00 | 457.53 ± 7.06 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 16 | pp4096+tg32 | 15434.32 ± 158.08 | 372.00 ± 3.34 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 32 | pp64 | 10588.34 ± 1043.71 | 0.00 ± 0.00 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 32 | tg16 | 0.00 ± 0.00 | 468.10 ± 9.16 |
| llama 1B Q4_K - Medium | 1.24 B | CUDA | 99 | 32 | pp4096+tg32 | 15544.54 ± 4.30 | 374.14 ± 7.18 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this before, but this example is supposed to show the effect of varying the number of threads. This does not have an effect with CUDA, please replace these numbers with a benchmark of the CPU backend. Since the CPU backend is comparatively slow I'd recommend doing the test with -pg 0,0 and adding that to the listed command.

@slaren
Copy link
Member

slaren commented Apr 18, 2025

The output of llama-bench is already too wide and I really don't like adding a column that is most of the time going to be wasted space. Instead, I propose adding a parameter that specifies at what depth in the context the test should be run. Include the value of this parameter in the test column rather than adding a new one. For example:

$ llama-bench -d 1024

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  pp512 @ 1024 |      6082.24 ± 16.44 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  tg128 @ 1024 |        170.76 ± 0.48 |
$ llama-bench -d 0,1024

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         pp512 |      7082.24 ± 16.44 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         tg128 |        270.76 ± 0.48 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  pp512 @ 1024 |      6082.24 ± 16.44 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  tg128 @ 1024 |        170.76 ± 0.48 |

@thevishalagarwal
Copy link
Author

One tradeoff I can think with depth parameter is - we would need two different runs for prompt and generation phase metrics and overall more tests to run.

Most people only care about pp+tg testcases - both prompt processing and token generation rate for different length of prompts. Prefill context is not so relevant as far as benchmark is concerned

E.g. for prompt length of 100, 1000 and generation length of 200
With -d depth, we will need separate runs for prompt and generation phase metrics

# prompt processing 
.\llama-bench.exe -m E:\models\gguf\Llama-3.2-1B-Instruct-Q4_K_M.gguf -n 0 -p 100,1000                
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 |         pp100 |    14337.02 ± 222.57 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 |        pp1000 |   25439.49 ± 1285.17 |

# token generation
.\llama-bench.exe -m E:\models\gguf\Llama-3.2-1B-Instruct-Q4_K_M.gguf -n 200 -d 100,1000 -p 0
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 |  tg200 @ d100 |        455.32 ± 4.60 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 | tg200 @ d1000 |        434.44 ± 2.80 |

In a single run with depth param, it will generate some more tests which are not relevant. In this 5 out of 9 tests are irrelevant

.\llama-bench.exe -m E:\models\gguf\Llama-3.2-1B-Instruct-Q4_K_M.gguf -n 200 -d 0,100,1000 -p 100,1000
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 |         pp100 |    14071.16 ± 139.59 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 |        pp1000 |   24894.03 ± 1039.21 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 |         tg200 |        454.72 ± 8.31 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 |  pp100 @ d100 |   13278.06 ± 1616.27 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 | pp1000 @ d100 |    22437.71 ± 172.04 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 |  tg200 @ d100 |        456.36 ± 1.90 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 | pp100 @ d1000 |    12221.52 ± 211.34 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 | pp1000 @ d1000 |    16480.51 ± 117.88 |
| llama 1B Q4_K - Medium         | 762.81 MiB |     1.24 B | CUDA       |  99 | tg200 @ d1000 |        432.85 ± 3.44 |

On the other hand, the same can be achieved in just 2 tests with the proposed changes

.\llama-bench.exe -m E:\models\gguf\Llama-3.2-1B-Instruct-Q4_K_M.gguf -p 0 -n 0 -pg 100,200 -pg 1000,200
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
| model                          |     params | backend    | ngl |          test |         prompt t/s |         gen t/s |
| ------------------------------ | ---------: | ---------- | --: | ------------: | -----------------: | --------------: |
| llama 1B Q4_K - Medium         |     1.24 B | CUDA       |  99 |   pp100+tg200 |  14994.38 ± 537.18 |   458.38 ± 4.59 |
| llama 1B Q4_K - Medium         |     1.24 B | CUDA       |  99 |  pp1000+tg200 |  25695.93 ± 551.37 |   434.26 ± 0.88 |

@JohannesGaessler
Copy link
Collaborator

Prompt processing is (on the hardware I'm working with) very fast so I don't particularly care whether or not that part is done multiple times. And since the contents of the KV cache are not relevant for llama-bench anyways you could also just fill it with random data instead of evaluating the model.

@thevishalagarwal
Copy link
Author

It's not about how long the tests take to run -it's about avoiding unnecessary or irrelevant tests and extra steps that only serve to confuse an average user. The primary goal of benchmarking should be to provide meaningful, understandable metrics. Other frameworks (TRT-LLM, ORT GenAI) and most people only care about pp+tg tests. I think we should make it easier for people to run these benchmarks, make it easily comprehensible and comparable.

@JohannesGaessler
Copy link
Collaborator

I don't think this is a large issue. If you want to use the markdown tables you can just run llama-bench multiple times and create a combined table by copy-pasting only the body for subsequent runs. And if you export the data as CSV or SQL it basically doesn't make a difference.

@JohannesGaessler
Copy link
Collaborator

We could maybe also keep the -tg CLI argument but make it so that it maps to some combinations of -p, -n, and -d that are not just the products of all provided values.

@thevishalagarwal
Copy link
Author

Created a separate PR to add depth arg. We can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants