vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

jeffbolznv · 2025-04-13T16:22:49Z

The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.

Perf results:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\second-state\StarCoder2-7B-GGUF\starcoder2-7b-Q4_0.gguf -m C:\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -p 0 -n 8192 --repetitions 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     |  99 |  1 |        tg8192 |         55.98 ± 0.00 |
| qwen2vl 7B IQ4_NL - 4.5 bpw    |   4.13 GiB |     7.62 B | Vulkan     |  99 |  1 |        tg8192 |         57.88 ± 0.00 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |        tg8192 |         68.98 ± 0.00 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\second-state\StarCoder2-7B-GGUF\starcoder2-7b-Q4_0.gguf -m C:\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -p 0 -n 8192 --repetitions 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     |  99 |  1 |        tg8192 |         74.72 ± 0.00 |
| qwen2vl 7B IQ4_NL - 4.5 bpw    |   4.13 GiB |     7.62 B | Vulkan     |  99 |  1 |        tg8192 |         74.87 ± 0.00 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |        tg8192 |         77.90 ± 0.00 |

(This qwen model seems to be broken at TOT, even with the cuda backend, but the speedup is probably realistic)

The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.

jeffbolznv requested a review from 0cc4m April 13, 2025 16:22

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

jeffbolznv commented Apr 13, 2025

vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

Are you sure you want to change the base?

vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

Conversation

jeffbolznv commented Apr 13, 2025