ggml-cuda flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config
Summary
ggml_cuda_fattn_mma_get_config() (in
src/ggml-cuda/fattn-mma-f16.cuh)
checks ampere_mma_available(cc) first — which returns true for
any cc >= GGML_CUDA_CC_AMPERE (= 800) — so Blackwell GPUs
(sm_120, RTX 5090 / 5080 / mobile 5060…) silently fall through to the
Ampere config tuned for sm_80. This is despite
blackwell_mma_available() already existing in
src/ggml-cuda/common.cuh:334.
Empirically (RTX 5090 + CUDA Toolkit 12.8 + chatterbox.cpp text-to-
speech, Turbo Q4_0, 232-token prompt), this is the single largest
remaining FLASH_ATTN_EXT perf gap to ggml-vulkan: ~67 ms /
utterance, or 49.6 % of the total CUDA ↔ Vulkan gap at chatterbox-
style shapes (DKQ=64, DV=64, ncols=64).
The picker today
// src/ggml-cuda/fattn-mma-f16.cuh:171-186
static __host__ fattn_mma_config ggml_cuda_fattn_mma_get_config(
const int DKQ, const int DV, const int ncols, const int cc) {
if (ampere_mma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols); // <-- Blackwell hits here
}
if (turing_mma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols);
}
if (amd_mfma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_cdna(DKQ, DV, ncols);
}
if (amd_wmma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_rdna(DKQ, DV, ncols);
}
GGML_ASSERT(volta_mma_available(cc));
return ggml_cuda_fattn_mma_get_config_volta(DKQ, DV, ncols);
}
For our chatterbox shape (DKQ=64, DV=64, ncols=64) the Ampere config
returns:
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 64, 64, 64, 128, 2, 64, 32, 32, 32, 2, true);
// nthreads ↑ ↑ occupancy
// nbatch_fa ↑ ↑ ↑ ↑ nstages
// nbatch_K2 nbatch_V2 nbatch_combine
i.e. nthreads=128, occupancy=2, nbatch_fa=64, nbatch_K2/V2/combine=32, nstages=2, Q_in_reg=true. This was tuned for sm_80 (Ampere).
Blackwell SMs have a larger register file (~256 KB vs Ampere's 64 KB)
and more shared memory; the existing config is almost certainly
leaving perf on the table — likely under-using the new hardware in
some combination of nthreads, occupancy, and nbatch_KV.
The TODO comments at lines 112
(// TODO tune specifically for Volta) and
129
(// TODO tune specifically for RDNA) confirm tuning new arch entries
is a recognised upstream task. The pattern an eventual
ggml_cuda_fattn_mma_get_config_blackwell should follow is already
established by those four arch functions.
Evidence: cross-backend op-bucket profile
Captured by running the same prompt + seed through
chatterbox.cpp's CUDA and Vulkan back-ends with their respective
per-op timing loggers (GGML_VK_PERF_LOGGER=1 upstream;
GGML_CUDA_PERF_LOGGER=1 shipped in
#1465). Aggregated
across a 232-token utterance:
| Op bucket |
CUDA µs |
Vulkan µs |
C/V |
gap µs |
MUL_MAT_VEC q4_0 |
186 603 |
0 |
inf |
+186 603 |
FLASH_ATTN_EXT |
143 926 |
70 270 |
2.05× |
+73 656 |
ADD |
95 909 |
67 120 |
1.43× |
+28 789 |
MUL_MAT_ADD q4_0 (fused) |
0 |
78 537 |
— |
−78 537 |
MUL_MAT_ADD_ADD q4_0 |
0 |
69 906 |
— |
−69 906 |
MUL_MAT f32 |
63 598 |
40 580 |
1.57× |
+23 018 |
| Total |
698 657 |
541 600 |
1.29× |
+157 057 |
After #1465 lands the
3-op fusion, the MUL_MAT_VEC q4_0 / MUL_MAT_ADD* rows collapse
together; the top remaining gap becomes FLASH_ATTN_EXT (~67 ms /
utterance), which our variant sweep
(scripts/bench-fattn-variants.sh)
already proved is not a picker-choice issue: TILE is 4 % slower,
WMMA falls back on Blackwell (no compiled SASS), VEC falls back per-
shape. MMA_F16 is the right kernel; the config fed to it is the
problem.
Reproduction
Anyone with a Blackwell GPU + #1465 (or just a manual cherry-pick of
the perf logger commit) can confirm:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build --target test-backend-ops -j
# pick a chatterbox-like flash-attn shape (DKQ=64, DV=64, ncols=64)
GGML_CUDA_PERF_LOGGER=1 \
./build/bin/test-backend-ops perf -o FLASH_ATTN_EXT -b CUDA0 \
2>&1 | grep "FLASH_ATTN_EXT"
The reported time on Blackwell will be ~2× the equivalent Vulkan
shader on the same hardware. Maintainers with multiple Blackwell
SKUs (5090 / 5080 / mobile 5060) can A/B candidate
ggml_cuda_fattn_mma_get_config_blackwell configs without a rebuild
between variants by extending the picker to read a small environment
variable (or by editing-and-rebuilding for the specific shape they
care about).
Why I'm not just sending a patch
Empirical tuning for a Blackwell config requires either:
- NVIDIA Nsight Compute hardware counters to characterise the
compute-vs-memory bound and identify which of nthreads,
occupancy, nbatch_fa, nbatch_KV, nstages, Q_in_reg is the
actual bottleneck. This is blocked on my host without root-level
NVreg_RestrictProfilingToAdminUsers=0.
- A multi-day parameter sweep on multiple Blackwell SKUs — a
single-hardware tuning result risks regressing on other Blackwells
with different SM topology (laptop 5060 vs desktop 5090 differ
meaningfully in register file size per SM and L2 cache).
Both are best done by upstream maintainers with access to multi-
Blackwell hardware and ncu. My contribution here is the diagnosis
and the diagnostic infrastructure (#1465
ships GGML_CUDA_PERF_LOGGER precisely so this kind of A/B is fast
and grep-friendly).
Suggested resolution
A future PR would:
- Add
ggml_cuda_fattn_mma_get_config_blackwell() in
fattn-mma-f16.cuh — initially a thin wrapper around
..._get_config_ampere() to make the gap explicit and provide the
right edit point for incremental tuning, with case-by-case
overrides added as data emerges.
- Add a
blackwell_mma_available(cc) branch at the top of
ggml_cuda_fattn_mma_get_config(...) (before the Ampere check)
so Blackwell hits the new config function.
- Tune the
(DKQ=64, DV=64, ncols=64) case first (chatterbox-style
prompt-phase shape) since that shows the largest gap; expand to
other shapes as Blackwell perf data accrues.
Happy to send #1 + #2 as a NoOp scaffolding PR if maintainers want
the structure landed early so per-shape tuning can come in as small,
reviewable patches later.
Otherwise, dropping this here as a finding from QVAC-17873 /
chatterbox.cpp work, adjacent to #1465.
Hardware tested: Linux 6.8, Ryzen 9 9950X3D, RTX 5090 32 GB,
NVIDIA driver 590.48.01, CUDA Toolkit 12.8, sm_120 native SASS.
Workload: chatterbox.cpp text-to-speech, Turbo Q4_0 GGUF,
232-token prompt, autoregressive decode + S3Gen + HiFT vocoder.
Observation date: 2026-04-27.
ggml-cudaflash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere configSummary
ggml_cuda_fattn_mma_get_config()(insrc/ggml-cuda/fattn-mma-f16.cuh)checks
ampere_mma_available(cc)first — which returnstrueforany
cc >= GGML_CUDA_CC_AMPERE (= 800)— so Blackwell GPUs(sm_120, RTX 5090 / 5080 / mobile 5060…) silently fall through to the
Ampere config tuned for sm_80. This is despite
blackwell_mma_available()already existing insrc/ggml-cuda/common.cuh:334.Empirically (RTX 5090 + CUDA Toolkit 12.8 + chatterbox.cpp text-to-
speech, Turbo Q4_0, 232-token prompt), this is the single largest
remaining
FLASH_ATTN_EXTperf gap to ggml-vulkan: ~67 ms /utterance, or 49.6 % of the total CUDA ↔ Vulkan gap at chatterbox-
style shapes (
DKQ=64, DV=64, ncols=64).The picker today
For our chatterbox shape (
DKQ=64, DV=64, ncols=64) the Ampere configreturns:
i.e.
nthreads=128, occupancy=2, nbatch_fa=64, nbatch_K2/V2/combine=32, nstages=2, Q_in_reg=true. This was tuned for sm_80 (Ampere).Blackwell SMs have a larger register file (~256 KB vs Ampere's 64 KB)
and more shared memory; the existing config is almost certainly
leaving perf on the table — likely under-using the new hardware in
some combination of
nthreads,occupancy, andnbatch_KV.The TODO comments at lines 112
(
// TODO tune specifically for Volta) and129
(
// TODO tune specifically for RDNA) confirm tuning new arch entriesis a recognised upstream task. The pattern an eventual
ggml_cuda_fattn_mma_get_config_blackwellshould follow is alreadyestablished by those four arch functions.
Evidence: cross-backend op-bucket profile
Captured by running the same prompt + seed through
chatterbox.cpp's CUDA and Vulkan back-ends with their respective
per-op timing loggers (
GGML_VK_PERF_LOGGER=1upstream;GGML_CUDA_PERF_LOGGER=1shipped in#1465). Aggregated
across a 232-token utterance:
MUL_MAT_VEC q4_0FLASH_ATTN_EXTADDMUL_MAT_ADD q4_0(fused)MUL_MAT_ADD_ADD q4_0MUL_MAT f32After #1465 lands the
3-op fusion, the
MUL_MAT_VEC q4_0/MUL_MAT_ADD*rows collapsetogether; the top remaining gap becomes
FLASH_ATTN_EXT(~67 ms /utterance), which our variant sweep
(
scripts/bench-fattn-variants.sh)already proved is not a picker-choice issue: TILE is 4 % slower,
WMMA falls back on Blackwell (no compiled SASS), VEC falls back per-
shape. MMA_F16 is the right kernel; the config fed to it is the
problem.
Reproduction
Anyone with a Blackwell GPU + #1465 (or just a manual cherry-pick of
the perf logger commit) can confirm:
The reported time on Blackwell will be ~2× the equivalent Vulkan
shader on the same hardware. Maintainers with multiple Blackwell
SKUs (5090 / 5080 / mobile 5060) can A/B candidate
ggml_cuda_fattn_mma_get_config_blackwellconfigs without a rebuildbetween variants by extending the picker to read a small environment
variable (or by editing-and-rebuilding for the specific shape they
care about).
Why I'm not just sending a patch
Empirical tuning for a Blackwell config requires either:
compute-vs-memory bound and identify which of
nthreads,occupancy,nbatch_fa,nbatch_KV,nstages,Q_in_regis theactual bottleneck. This is blocked on my host without root-level
NVreg_RestrictProfilingToAdminUsers=0.single-hardware tuning result risks regressing on other Blackwells
with different SM topology (laptop 5060 vs desktop 5090 differ
meaningfully in register file size per SM and L2 cache).
Both are best done by upstream maintainers with access to multi-
Blackwell hardware and
ncu. My contribution here is the diagnosisand the diagnostic infrastructure (#1465
ships
GGML_CUDA_PERF_LOGGERprecisely so this kind of A/B is fastand grep-friendly).
Suggested resolution
A future PR would:
ggml_cuda_fattn_mma_get_config_blackwell()infattn-mma-f16.cuh— initially a thin wrapper around..._get_config_ampere()to make the gap explicit and provide theright edit point for incremental tuning, with case-by-case
overrides added as data emerges.
blackwell_mma_available(cc)branch at the top ofggml_cuda_fattn_mma_get_config(...)(before the Ampere check)so Blackwell hits the new config function.
(DKQ=64, DV=64, ncols=64)case first (chatterbox-styleprompt-phase shape) since that shows the largest gap; expand to
other shapes as Blackwell perf data accrues.
Happy to send #1 + #2 as a NoOp scaffolding PR if maintainers want
the structure landed early so per-shape tuning can come in as small,
reviewable patches later.
Otherwise, dropping this here as a finding from QVAC-17873 /
chatterbox.cpp work, adjacent to #1465.
Hardware tested: Linux 6.8, Ryzen 9 9950X3D, RTX 5090 32 GB,
NVIDIA driver 590.48.01, CUDA Toolkit 12.8, sm_120 native SASS.
Workload:
chatterbox.cpptext-to-speech, Turbo Q4_0 GGUF,232-token prompt, autoregressive decode + S3Gen + HiFT vocoder.
Observation date: 2026-04-27.