ggml-cuda: flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config

# `ggml-cuda` flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config

## Summary

`ggml_cuda_fattn_mma_get_config()` (in
[`src/ggml-cuda/fattn-mma-f16.cuh`](https://github.com/ggml-org/ggml/blob/master/src/ggml-cuda/fattn-mma-f16.cuh#L171-L186))
checks `ampere_mma_available(cc)` first — which returns `true` for
*any* `cc >= GGML_CUDA_CC_AMPERE (= 800)` — so Blackwell GPUs
(sm_120, RTX 5090 / 5080 / mobile 5060…) silently fall through to the
**Ampere config tuned for sm_80**.  This is despite
`blackwell_mma_available()` already existing in
[`src/ggml-cuda/common.cuh:334`](https://github.com/ggml-org/ggml/blob/master/src/ggml-cuda/common.cuh#L334).

Empirically (RTX 5090 + CUDA Toolkit 12.8 + chatterbox.cpp text-to-
speech, Turbo Q4_0, 232-token prompt), this is the single largest
remaining `FLASH_ATTN_EXT` perf gap to ggml-vulkan: **~67 ms /
utterance, or 49.6 % of the total CUDA ↔ Vulkan gap** at chatterbox-
style shapes (`DKQ=64, DV=64, ncols=64`).

## The picker today

```cpp
// src/ggml-cuda/fattn-mma-f16.cuh:171-186
static __host__ fattn_mma_config ggml_cuda_fattn_mma_get_config(
        const int DKQ, const int DV, const int ncols, const int cc) {
    if (ampere_mma_available(cc)) {
        return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols);   // <-- Blackwell hits here
    }
    if (turing_mma_available(cc)) {
        return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols);
    }
    if (amd_mfma_available(cc)) {
        return ggml_cuda_fattn_mma_get_config_cdna(DKQ, DV, ncols);
    }
    if (amd_wmma_available(cc)) {
        return ggml_cuda_fattn_mma_get_config_rdna(DKQ, DV, ncols);
    }
    GGML_ASSERT(volta_mma_available(cc));
    return ggml_cuda_fattn_mma_get_config_volta(DKQ, DV, ncols);
}
```

For our chatterbox shape (`DKQ=64, DV=64, ncols=64`) the Ampere config
returns:

```cpp
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 64,  64, 64, 128, 2,  64,  32,  32,  32, 2, true);
//                                       nthreads ↑    ↑ occupancy
//                                                nbatch_fa ↑   ↑    ↑    ↑ nstages
//                                                          nbatch_K2 nbatch_V2 nbatch_combine
```

i.e. `nthreads=128, occupancy=2, nbatch_fa=64, nbatch_K2/V2/combine=32,
nstages=2, Q_in_reg=true`.  This was tuned for sm_80 (Ampere).
Blackwell SMs have a larger register file (~256 KB vs Ampere's 64 KB)
and more shared memory; the existing config is almost certainly
leaving perf on the table — likely under-using the new hardware in
some combination of `nthreads`, `occupancy`, and `nbatch_KV`.

The TODO comments at lines [112](https://github.com/ggml-org/ggml/blob/master/src/ggml-cuda/fattn-mma-f16.cuh#L112)
(`// TODO tune specifically for Volta`) and
[129](https://github.com/ggml-org/ggml/blob/master/src/ggml-cuda/fattn-mma-f16.cuh#L129)
(`// TODO tune specifically for RDNA`) confirm tuning new arch entries
is a recognised upstream task.  The pattern an eventual
`ggml_cuda_fattn_mma_get_config_blackwell` should follow is already
established by those four arch functions.

## Evidence: cross-backend op-bucket profile

Captured by running the same prompt + seed through
chatterbox.cpp's CUDA and Vulkan back-ends with their respective
per-op timing loggers (`GGML_VK_PERF_LOGGER=1` upstream;
`GGML_CUDA_PERF_LOGGER=1` shipped in
[#1465](https://github.com/ggml-org/ggml/pull/1465)).  Aggregated
across a 232-token utterance:

| Op bucket                  | CUDA µs   | Vulkan µs |  C/V   |  gap µs   |
|----------------------------|----------:|----------:|-------:|----------:|
| `MUL_MAT_VEC q4_0`         | 186 603   |       0   | inf    |  +186 603 |
| **`FLASH_ATTN_EXT`**       | **143 926**| **70 270** | **2.05×** | **+73 656** |
| `ADD`                      |  95 909   |  67 120   | 1.43×  |   +28 789 |
| `MUL_MAT_ADD q4_0` (fused) |       0   |  78 537   | —      |   −78 537 |
| `MUL_MAT_ADD_ADD q4_0`     |       0   |  69 906   | —      |   −69 906 |
| `MUL_MAT f32`              |  63 598   |  40 580   | 1.57×  |   +23 018 |
| **Total**                  | 698 657   | 541 600   | 1.29×  |  +157 057 |

After [#1465](https://github.com/ggml-org/ggml/pull/1465) lands the
3-op fusion, the `MUL_MAT_VEC q4_0` / `MUL_MAT_ADD*` rows collapse
together; the top remaining gap becomes `FLASH_ATTN_EXT` (~67 ms /
utterance), which our variant sweep
([`scripts/bench-fattn-variants.sh`](https://github.com/Zbig9000/ggml/blob/pr3-cuda-perf-logger/...))
already proved is **not a picker-choice issue**: TILE is 4 % slower,
WMMA falls back on Blackwell (no compiled SASS), VEC falls back per-
shape.  MMA_F16 is the right kernel; the **config fed to it** is the
problem.

## Reproduction

Anyone with a Blackwell GPU + #1465 (or just a manual cherry-pick of
the perf logger commit) can confirm:

```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build --target test-backend-ops -j

# pick a chatterbox-like flash-attn shape (DKQ=64, DV=64, ncols=64)
GGML_CUDA_PERF_LOGGER=1 \
  ./build/bin/test-backend-ops perf -o FLASH_ATTN_EXT -b CUDA0 \
  2>&1 | grep "FLASH_ATTN_EXT"
```

The reported time on Blackwell will be ~2× the equivalent Vulkan
shader on the same hardware.  Maintainers with multiple Blackwell
SKUs (5090 / 5080 / mobile 5060) can A/B candidate
`ggml_cuda_fattn_mma_get_config_blackwell` configs without a rebuild
between variants by extending the picker to read a small environment
variable (or by editing-and-rebuilding for the specific shape they
care about).

## Why I'm not just sending a patch

Empirical tuning for a Blackwell config requires either:

1. **NVIDIA Nsight Compute hardware counters** to characterise the
   compute-vs-memory bound and identify which of `nthreads`,
   `occupancy`, `nbatch_fa`, `nbatch_KV`, `nstages`, `Q_in_reg` is the
   actual bottleneck.  This is blocked on my host without root-level
   `NVreg_RestrictProfilingToAdminUsers=0`.
2. **A multi-day parameter sweep on multiple Blackwell SKUs** — a
   single-hardware tuning result risks regressing on other Blackwells
   with different SM topology (laptop 5060 vs desktop 5090 differ
   meaningfully in register file size per SM and L2 cache).

Both are best done by upstream maintainers with access to multi-
Blackwell hardware and `ncu`.  My contribution here is the diagnosis
and the diagnostic infrastructure ([#1465](https://github.com/ggml-org/ggml/pull/1465)
ships `GGML_CUDA_PERF_LOGGER` precisely so this kind of A/B is fast
and grep-friendly).

## Suggested resolution

A future PR would:

1. Add `ggml_cuda_fattn_mma_get_config_blackwell()` in
   `fattn-mma-f16.cuh` — initially a thin wrapper around
   `..._get_config_ampere()` to make the gap explicit and provide the
   right edit point for incremental tuning, with case-by-case
   overrides added as data emerges.
2. Add a `blackwell_mma_available(cc)` branch at the **top** of
   `ggml_cuda_fattn_mma_get_config(...)` (before the Ampere check)
   so Blackwell hits the new config function.
3. Tune the `(DKQ=64, DV=64, ncols=64)` case first (chatterbox-style
   prompt-phase shape) since that shows the largest gap; expand to
   other shapes as Blackwell perf data accrues.

Happy to send #1 + #2 as a NoOp scaffolding PR if maintainers want
the structure landed early so per-shape tuning can come in as small,
reviewable patches later.

Otherwise, dropping this here as a finding from QVAC-17873 /
chatterbox.cpp work, adjacent to #1465.

---

**Hardware tested**: Linux 6.8, Ryzen 9 9950X3D, RTX 5090 32 GB,
NVIDIA driver 590.48.01, CUDA Toolkit 12.8, sm_120 native SASS.
**Workload**: `chatterbox.cpp` text-to-speech, Turbo Q4_0 GGUF,
232-token prompt, autoregressive decode + S3Gen + HiFT vocoder.
**Observation date**: 2026-04-27.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda: flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config #1466

`ggml-cuda` flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config

Summary

The picker today

Evidence: cross-backend op-bucket profile

Reproduction

Why I'm not just sending a patch

Suggested resolution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Op bucket	CUDA µs	Vulkan µs	C/V	gap µs
`MUL_MAT_VEC q4_0`	186 603	0	inf	+186 603
`FLASH_ATTN_EXT`	143 926	70 270	2.05×	+73 656
`ADD`	95 909	67 120	1.43×	+28 789
`MUL_MAT_ADD q4_0` (fused)	0	78 537	—	−78 537
`MUL_MAT_ADD_ADD q4_0`	0	69 906	—	−69 906
`MUL_MAT f32`	63 598	40 580	1.57×	+23 018
Total	698 657	541 600	1.29×	+157 057

ggml-cuda: flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config #1466

Description

ggml-cuda flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config

Summary

The picker today

Evidence: cross-backend op-bucket profile

Reproduction

Why I'm not just sending a patch

Suggested resolution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`ggml-cuda` flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config