vulkan: Use unclamped loads for flash attention mask #12720

jeffbolznv · 2025-04-02T15:16:12Z

This is stacked on #12627.

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3-8b-instruct.Q4_K_M.gguf -fa 1 -p 16384 -n 0 --repetitions 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |       pp16384 |       2197.17 ± 0.00 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3-8b-instruct.Q4_K_M.gguf -fa 1 -p 16384 -n 0 --repetitions 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |       pp16384 |       2338.80 ± 0.00 |

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

jeffbolznv requested a review from 0cc4m April 2, 2025 15:16

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 2, 2025

vulkan: Use unclamped loads for flash attention mask

12b198f

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

jeffbolznv force-pushed the flash_mask branch from 2df810a to 12b198f Compare April 2, 2025 19:35

0cc4m approved these changes Apr 6, 2025

View reviewed changes

0cc4m merged commit 80b717d into ggml-org:master Apr 6, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Use unclamped loads for flash attention mask #12720

vulkan: Use unclamped loads for flash attention mask #12720

jeffbolznv commented Apr 2, 2025

vulkan: Use unclamped loads for flash attention mask #12720

vulkan: Use unclamped loads for flash attention mask #12720

Conversation

jeffbolznv commented Apr 2, 2025