Skip to content

ggml-cuda: flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config #1466

@Zbig9000

Description

@Zbig9000

ggml-cuda flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config

Summary

ggml_cuda_fattn_mma_get_config() (in
src/ggml-cuda/fattn-mma-f16.cuh)
checks ampere_mma_available(cc) first — which returns true for
any cc >= GGML_CUDA_CC_AMPERE (= 800) — so Blackwell GPUs
(sm_120, RTX 5090 / 5080 / mobile 5060…) silently fall through to the
Ampere config tuned for sm_80. This is despite
blackwell_mma_available() already existing in
src/ggml-cuda/common.cuh:334.

Empirically (RTX 5090 + CUDA Toolkit 12.8 + chatterbox.cpp text-to-
speech, Turbo Q4_0, 232-token prompt), this is the single largest
remaining FLASH_ATTN_EXT perf gap to ggml-vulkan: ~67 ms /
utterance, or 49.6 % of the total CUDA ↔ Vulkan gap
at chatterbox-
style shapes (DKQ=64, DV=64, ncols=64).

The picker today

// src/ggml-cuda/fattn-mma-f16.cuh:171-186
static __host__ fattn_mma_config ggml_cuda_fattn_mma_get_config(
        const int DKQ, const int DV, const int ncols, const int cc) {
    if (ampere_mma_available(cc)) {
        return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols);   // <-- Blackwell hits here
    }
    if (turing_mma_available(cc)) {
        return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols);
    }
    if (amd_mfma_available(cc)) {
        return ggml_cuda_fattn_mma_get_config_cdna(DKQ, DV, ncols);
    }
    if (amd_wmma_available(cc)) {
        return ggml_cuda_fattn_mma_get_config_rdna(DKQ, DV, ncols);
    }
    GGML_ASSERT(volta_mma_available(cc));
    return ggml_cuda_fattn_mma_get_config_volta(DKQ, DV, ncols);
}

For our chatterbox shape (DKQ=64, DV=64, ncols=64) the Ampere config
returns:

GGML_CUDA_FATTN_MMA_CONFIG_CASE( 64,  64, 64, 128, 2,  64,  32,  32,  32, 2, true);
//                                       nthreads ↑    ↑ occupancy
//                                                nbatch_fa ↑   ↑    ↑    ↑ nstages
//                                                          nbatch_K2 nbatch_V2 nbatch_combine

i.e. nthreads=128, occupancy=2, nbatch_fa=64, nbatch_K2/V2/combine=32, nstages=2, Q_in_reg=true. This was tuned for sm_80 (Ampere).
Blackwell SMs have a larger register file (~256 KB vs Ampere's 64 KB)
and more shared memory; the existing config is almost certainly
leaving perf on the table — likely under-using the new hardware in
some combination of nthreads, occupancy, and nbatch_KV.

The TODO comments at lines 112
(// TODO tune specifically for Volta) and
129
(// TODO tune specifically for RDNA) confirm tuning new arch entries
is a recognised upstream task. The pattern an eventual
ggml_cuda_fattn_mma_get_config_blackwell should follow is already
established by those four arch functions.

Evidence: cross-backend op-bucket profile

Captured by running the same prompt + seed through
chatterbox.cpp's CUDA and Vulkan back-ends with their respective
per-op timing loggers (GGML_VK_PERF_LOGGER=1 upstream;
GGML_CUDA_PERF_LOGGER=1 shipped in
#1465). Aggregated
across a 232-token utterance:

Op bucket CUDA µs Vulkan µs C/V gap µs
MUL_MAT_VEC q4_0 186 603 0 inf +186 603
FLASH_ATTN_EXT 143 926 70 270 2.05× +73 656
ADD 95 909 67 120 1.43× +28 789
MUL_MAT_ADD q4_0 (fused) 0 78 537 −78 537
MUL_MAT_ADD_ADD q4_0 0 69 906 −69 906
MUL_MAT f32 63 598 40 580 1.57× +23 018
Total 698 657 541 600 1.29× +157 057

After #1465 lands the
3-op fusion, the MUL_MAT_VEC q4_0 / MUL_MAT_ADD* rows collapse
together; the top remaining gap becomes FLASH_ATTN_EXT (~67 ms /
utterance), which our variant sweep
(scripts/bench-fattn-variants.sh)
already proved is not a picker-choice issue: TILE is 4 % slower,
WMMA falls back on Blackwell (no compiled SASS), VEC falls back per-
shape. MMA_F16 is the right kernel; the config fed to it is the
problem.

Reproduction

Anyone with a Blackwell GPU + #1465 (or just a manual cherry-pick of
the perf logger commit) can confirm:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build --target test-backend-ops -j

# pick a chatterbox-like flash-attn shape (DKQ=64, DV=64, ncols=64)
GGML_CUDA_PERF_LOGGER=1 \
  ./build/bin/test-backend-ops perf -o FLASH_ATTN_EXT -b CUDA0 \
  2>&1 | grep "FLASH_ATTN_EXT"

The reported time on Blackwell will be ~2× the equivalent Vulkan
shader on the same hardware. Maintainers with multiple Blackwell
SKUs (5090 / 5080 / mobile 5060) can A/B candidate
ggml_cuda_fattn_mma_get_config_blackwell configs without a rebuild
between variants by extending the picker to read a small environment
variable (or by editing-and-rebuilding for the specific shape they
care about).

Why I'm not just sending a patch

Empirical tuning for a Blackwell config requires either:

  1. NVIDIA Nsight Compute hardware counters to characterise the
    compute-vs-memory bound and identify which of nthreads,
    occupancy, nbatch_fa, nbatch_KV, nstages, Q_in_reg is the
    actual bottleneck. This is blocked on my host without root-level
    NVreg_RestrictProfilingToAdminUsers=0.
  2. A multi-day parameter sweep on multiple Blackwell SKUs — a
    single-hardware tuning result risks regressing on other Blackwells
    with different SM topology (laptop 5060 vs desktop 5090 differ
    meaningfully in register file size per SM and L2 cache).

Both are best done by upstream maintainers with access to multi-
Blackwell hardware and ncu. My contribution here is the diagnosis
and the diagnostic infrastructure (#1465
ships GGML_CUDA_PERF_LOGGER precisely so this kind of A/B is fast
and grep-friendly).

Suggested resolution

A future PR would:

  1. Add ggml_cuda_fattn_mma_get_config_blackwell() in
    fattn-mma-f16.cuh — initially a thin wrapper around
    ..._get_config_ampere() to make the gap explicit and provide the
    right edit point for incremental tuning, with case-by-case
    overrides added as data emerges.
  2. Add a blackwell_mma_available(cc) branch at the top of
    ggml_cuda_fattn_mma_get_config(...) (before the Ampere check)
    so Blackwell hits the new config function.
  3. Tune the (DKQ=64, DV=64, ncols=64) case first (chatterbox-style
    prompt-phase shape) since that shows the largest gap; expand to
    other shapes as Blackwell perf data accrues.

Happy to send #1 + #2 as a NoOp scaffolding PR if maintainers want
the structure landed early so per-shape tuning can come in as small,
reviewable patches later.

Otherwise, dropping this here as a finding from QVAC-17873 /
chatterbox.cpp work, adjacent to #1465.


Hardware tested: Linux 6.8, Ryzen 9 9950X3D, RTX 5090 32 GB,
NVIDIA driver 590.48.01, CUDA Toolkit 12.8, sm_120 native SASS.
Workload: chatterbox.cpp text-to-speech, Turbo Q4_0 GGUF,
232-token prompt, autoregressive decode + S3Gen + HiFT vocoder.
Observation date: 2026-04-27.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions