Dual Epyc Genoa/Turin token generation performance bottleneck #11733
Unanswered
fairydreaming
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Part 1 - The Problem
I have a temporary access to a dual CPU Epyc Turin system. I did some initial performance tests with llama.cpp running on a single CPU:
For CPU 0:
For CPU 1:
Unfortunately when I run llama.cpp on both CPUs at once with
--numa distribute
the prompt processing performance doubles, while the token generation performance stays at the same level (actually it's even a bit worse) as with a single CPU:Part 2 - The Workaround
I did some more tests and found something weird. If I run llama-bench with prompt processing and token generation tests with
--numa distribute
on a dual-CPU system, the result is:but when I dropped caches and ran ONLY the generation test it magically became faster:
So my current hypothesis is that the placement of tensors in memory resulting from the prompt processing is for some reason sub-optimal for the token generation. This is definitely something to investigate further.
But loading the model during generation instead of prompt processing can be a viable workaround to the problem. I mean if running a generation benchmark results in optimal placement of tensors in memory, then just run it first and you are done. The generation performance stays high after this even when running combined benchmark:
I described this workaround in #11744 so that people can try it.
Part 3 - The Cause
I did some more investigation on what causes this and it seems to be related to
GGML_USE_LLAMAFILE
andllamafile_sgemm()
calls. If the model weights are loaded with these calls, the token generation performance is reduced. Example:When I disable GGML_USE_LLAMAFILE the token generation rate is not reduced (but prompt processing is much slower,
llamafile_sgemm()
gives it a huge performance boost):When I use trick described in #11744 (with GGML_USE_LLAMAFILE enabled) it's possible to keep both prompt processing and token generation rate fast:
Regarding the exact cause, it's a large number of remote NUMA node memory accesses during token generation if the model weights were loaded with
llamafile_sgemm()
calls. Measured withnumatop
during "slow" generation:while during "fast" generation we have:
It seems that
llamafile_sgemm()
places the model weights in disk cache memory in such a way that a large number of remote NUMA node memory accesses is needed when using the weights during token generation.Part 4 - The Solution
Simplest solution for this problem would be to warm-up the model with token generation instead of prompt processing, so that
llamafile_sgemm()
calls are not used to load model weights. I tested it by commenting the EOS token in creation of the warm-up batch (so that there's only a single token in this batch) and it seems to work. I tested it by running llama-cli and then llama-bench to measure token generation rate. With a single token in the warm-up batch I have:When there are two tokens (BOS and EOS) I have:
Disadvantage of this simple workaround is that disabling the warm-up with a command line option would still cause reduction of the token generation performance.
A proper fix for this problem would be a NUMA-aware matrix multiplication implementation which:
Another possible solution is implementation of Megatron-LM-style tensor parallelism. In this case each NUMA node would use only its associated part of model weights and would keep them in local memory.
Beta Was this translation helpful? Give feedback.
All reactions