Dual Epyc Genoa/Turin token generation performance bottleneck #11733

fairydreaming · 2025-02-07T13:34:54Z

fairydreaming
Feb 7, 2025
Collaborator

Part 1 - The Problem

I have a temporary access to a dual CPU Epyc Turin system. I did some initial performance tests with llama.cpp running on a single CPU:

For CPU 0:

$ numactl -m 0 -N 0 ./build/bin/llama-bench --numa numactl -t 16 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 1
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      16 |         pp512 |         21.52 ± 0.00 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      16 |         tg128 |          2.34 ± 0.00 |

build: c026ba3c (4663)

For CPU 1:

$ numactl -m 1 -N 1 ./build/bin/llama-bench --numa numactl -t 16 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 3
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      16 |         pp512 |         21.54 ± 0.00 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      16 |         tg128 |          2.32 ± 0.00 |

build: c026ba3c (4663)

Unfortunately when I run llama.cpp on both CPUs at once with --numa distribute the prompt processing performance doubles, while the token generation performance stays at the same level (actually it's even a bit worse) as with a single CPU:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 3
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         pp512 |         39.10 ± 0.02 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         tg128 |          2.27 ± 0.00 |

build: c026ba3c (4663)

Part 2 - The Workaround

I did some more tests and found something weird. If I run llama-bench with prompt processing and token generation tests with --numa distribute on a dual-CPU system, the result is:

(llama.cpp) fairydreaming@epyc:/data/fairydreaming/llama.cpp$ ./build/bin/llama-bench --numa distribute -t 32 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 1
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         pp512 |         39.12 ± 0.00 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         tg128 |          2.40 ± 0.00 |

build: c026ba3c (4663)

but when I dropped caches and ran ONLY the generation test it magically became faster:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 1 -p 0
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         tg128 |          4.30 ± 0.00 |

build: c026ba3c (4663)

So my current hypothesis is that the placement of tensors in memory resulting from the prompt processing is for some reason sub-optimal for the token generation. This is definitely something to investigate further.

But loading the model during generation instead of prompt processing can be a viable workaround to the problem. I mean if running a generation benchmark results in optimal placement of tensors in memory, then just run it first and you are done. The generation performance stays high after this even when running combined benchmark:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 1
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         pp512 |         39.48 ± 0.00 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         tg128 |          4.31 ± 0.00 |

build: c026ba3c (4663)

I described this workaround in #11744 so that people can try it.

Part 3 - The Cause

I did some more investigation on what causes this and it seems to be related to GGML_USE_LLAMAFILE and llamafile_sgemm() calls. If the model weights are loaded with these calls, the token generation performance is reduced. Example:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         pp512 |        187.03 ± 0.11 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         10.59 ± 0.01 |

build: c026ba3c (4663)

When I disable GGML_USE_LLAMAFILE the token generation rate is not reduced (but prompt processing is much slower, llamafile_sgemm() gives it a huge performance boost):

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         pp512 |         85.66 ± 0.07 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         17.26 ± 0.02 |

build: c026ba3c (4663)

When I use trick described in #11744 (with GGML_USE_LLAMAFILE enabled) it's possible to keep both prompt processing and token generation rate fast:

fairydreaming@epyc:/data/fairydreaming/llama.cpp$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         pp512 |        187.79 ± 0.12 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         17.17 ± 0.03 |

build: c026ba3c (4663)

Regarding the exact cause, it's a large number of remote NUMA node memory accesses during token generation if the model weights were loaded with llamafile_sgemm() calls. Measured with numatop during "slow" generation:

       PID           PROC     RMA(K)     LMA(K)    RMA/LMA        CPI     *CPU%%
     28241    llama-bench  9155918.5  6763594.5        1.4       2.97      42.3

while during "fast" generation we have:

       PID           PROC     RMA(K)     LMA(K)    RMA/LMA        CPI     *CPU%%
     28363    llama-bench    55601.5 20677757.1        0.0       1.87      42.4

It seems that llamafile_sgemm() places the model weights in disk cache memory in such a way that a large number of remote NUMA node memory accesses is needed when using the weights during token generation.

Part 4 - The Solution

Simplest solution for this problem would be to warm-up the model with token generation instead of prompt processing, so that llamafile_sgemm() calls are not used to load model weights. I tested it by commenting the EOS token in creation of the warm-up batch (so that there's only a single token in this batch) and it seems to work. I tested it by running llama-cli and then llama-bench to measure token generation rate. With a single token in the warm-up batch I have:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf -p 0
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         17.09 ± 0.21 |

build: c026ba3c (4663)

When there are two tokens (BOS and EOS) I have:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf -p 0
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         10.66 ± 0.00 |

build: c026ba3c (4663)

Disadvantage of this simple workaround is that disabling the warm-up with a command line option would still cause reduction of the token generation performance.

A proper fix for this problem would be a NUMA-aware matrix multiplication implementation which:

would access given part of model weights on the same NUMA node where they were initially loaded from disk and cached
would be used both during prompt processing and token generation

Another possible solution is implementation of Megatron-LM-style tensor parallelism. In this case each NUMA node would use only its associated part of model weights and would keep them in local memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dual Epyc Genoa/Turin token generation performance bottleneck #11733

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Dual Epyc Genoa/Turin token generation performance bottleneck #11733

fairydreaming Feb 7, 2025 Collaborator

Part 1 - The Problem

Part 2 - The Workaround

Part 3 - The Cause

Part 4 - The Solution

Replies: 0 comments

fairydreaming
Feb 7, 2025
Collaborator