Optimization for CPU Causal Flash Attention (integrated into Qwen3) #3254

DrJesseGlass · 2025-12-20T18:31:33Z

This work centered around implementing flash attention in the Qwen3 transformer.

Implementing flash for GPU was straightforward but the cpu_flash_attention had a fundamentally different API structure. So I shimmed the old cpu_flash attention and moved it to a modular attention/cpu_flash candle-nn/src/attention/cpu_flash/standard.rs with only two extremely minor changes (changed two functions to pub(crate) from private). This enabled me to cleanly implement a causal attention for CPU which leverages the loop bound approach and dispatch logic. This saved 14% peak memory compared to the prior cpu_flash method and a 3-4% speed improvement (see experiments below).

This memory improvement applies to standard cpu attention also. But unfortunately, all existing cpu flash attentions are still currently slower than standard cpu attention. (Future work will be to improve this with fused kernels.)

Experiments were run on NVIDIA DGX (CPU and GPU benchmarks) for Qwen3-0.6B and the prefill was the first ~1,500 words of Ulysses.

CPU Throughput

Implementation	Throughput	Notes
Causal (new)	3.54 t/s	Loop-bound, no mask tensor
Prior flash (mask)	3.42 t/s	Requires mask tensor
Standard matmul	3.69 t/s	Baseline

Causal is ~3.5% faster than prior mask-based flash attention.

CPU Memory (Peak RSS)

Implementation	Peak Memory	Δ vs Standard
Flash attention	2.91 GB	-14%
Standard matmul	3.37 GB	baseline

Flash attention reduces peak memory by ~460 MB for 1685-token prefill by avoiding full Q×KV matrix materialization.

GPU Throughput

Implementation	Throughput
GPU flash attention	40.39 t/s
GPU standard matmul	38.04 t/s

Future Work

Unified SDPA module: Add scaled dot-product attention with dispatch for all backends (CPU flash, GPU flash, GPU matmul). This will simplify the Qwen3 transformer and other implementations.

Improve causal performance: Pursuing fused SIMD kernels that handle GQA head broadcasting internally. Had tried Broadcast matmul but ran into limitations #3253 however fused kernels are likely more performant.

Explore interleaved KV-cached attention.

DrJesseGlass · 2025-12-22T17:50:59Z

Thinking to also include Rotary embeddings and other attention related shared processes to this component.

DrJesseGlass added 16 commits October 27, 2025 12:29

add flash attn to qwen3

2c7628e

feature flash-attn

2b2a2e3

merged main

61798f8

flash generative true

2470007

no mask in cpu flash

887de1c

add causal loop-bound optimization to cpu_flash_attention

ad565df

attempt at hybrid

dbb4a9a

working tiling

1e268d9

working but poor performance tiled flash

b2cb526

factorized attention cpu flash

6fd65d3

logging

ba5f222

causal cpu flash

0520f7b

depracation warning

6aa8a00

depracation warning specific

3f50c1e

formatting

6dbc212

interleaved will live in cpu_flash

c3f93b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization for CPU Causal Flash Attention (integrated into Qwen3) #3254

Optimization for CPU Causal Flash Attention (integrated into Qwen3) #3254

Uh oh!

DrJesseGlass commented Dec 20, 2025

Uh oh!

DrJesseGlass commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Optimization for CPU Causal Flash Attention (integrated into Qwen3) #3254

Are you sure you want to change the base?

Optimization for CPU Causal Flash Attention (integrated into Qwen3) #3254

Uh oh!

Conversation

DrJesseGlass commented Dec 20, 2025

CPU Throughput

CPU Memory (Peak RSS)

GPU Throughput

Future Work

Uh oh!

DrJesseGlass commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant