perf(qwen35): use in-place GDN state write in pure AR by dusterbloom · Pull Request #473 · Luce-Org/lucebox-hub

dusterbloom · 2026-06-30T20:44:20Z

Summary

Use the ggml in-place GDN final-state write from Lucebox ggml PR24 in the Qwen35 pure-AR path.

This keeps the call site strict: pure AR only, no chunked path, no tree parent ids, no capture or persist intermediate path, and it keeps main behavior from #465.

Dependency

Depends on Lucebox ggml PR24:
Luce-Org/lucebox-ggml#24

This PR intentionally remains draft until that ggml submodule commit is available from the upstream submodule remote.

Validation

cmake --build server/build-pr-gdn-inplace --target test_server_unit -j 8
server/build-pr-gdn-inplace/test_server_unit
Result: 2022 assertions, 0 failures

In build_delta_net_block, replace the two-op silu(z_4d) + mul(output, z_silu) pattern with ggml_swiglu_split(z_4d, output_n). swiglu_split(a,b)=silu(a)*b, mathematically identical. Removes 30 kernel launches per AR decode step (UNARY-30, MUL-30, GLU+30; graph 2744->2714). Token-identical at temp=0 confirmed. Part 1 of launch-count gap closure (30/162 = 18.5% of gap).

…re AR FWHT gate: DFLASH_NO_WHT=1 disables the K/Q rotation that costs ~16 extra kernels/token (~3% decode). q4_0 KV doesn't need the outlier spreading on Ampere. Default off for TQ3_0 (WHT baked into quant). Repeat_back skip: in pure AR (no tree/capture), the SSM kernel's broadcast handles the head mismatch natively — skip the repeat_back copy per DeltaNet layer per step (~1% decode).

The DeltaNet conv_input concat (conv_states_r, qkv_T) hit the slow concat_f32_dim0 kernel (15us) because both inputs were non-contiguous (reshape + transpose). Wrapping in ggml_cont routes to the fast concat_cont path (3.3us) — ~0.5ms/token saved in the 30-layer serial DeltaNet recurrence.

…n FA bucket build_target_step rebuilds the 2744-node cgraph every decode step, costing ~0.38ms/token and resetting ggml-cuda's CUDA-graph warmup counter. The only per-step topology change is win_len_padded = round_up(committed+1, 256), which only advances every 256 steps. Gate the rebuild on committed/256; within a bucket reuse the cached graph and update only mutable inputs. DFLASH_AR_NO_REUSE=1 restores the per-step rebuild. Measured: +6% decode tok/s (106→113) on Q3_K_XL, q4_0 KV, 16K ctx.

Move ggml_cont from the concat call to the qkv_T transpose definition. This makes qkv_T contiguous once (1 cont per DeltaNet layer) instead of wrapping both concat inputs (2 conts per layer). conv_states_r is already contiguous (reshape of contiguous cache tensor). Net: -30 graph nodes (2704→2674), compute -24µs/step.

perf(qwen35): use in-place GDN state write in pure AR

235f704

dusterbloom mentioned this pull request Jun 30, 2026

fix(qwen35): route non-tree GDN capture through persist buffer (src[7]) #469

Merged

dusterbloom added 6 commits July 1, 2026 01:55

perf(qwen35): recover dense AR replay fast path

6309f28

dusterbloom mentioned this pull request Jul 1, 2026

perf(qwen35): recover dense AR replay fast path #476

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(qwen35): use in-place GDN state write in pure AR#473

perf(qwen35): use in-place GDN state write in pure AR#473
dusterbloom wants to merge 7 commits into
Luce-Org:mainfrom
dusterbloom:perf/qwen35-gdn-inplace-consumer

dusterbloom commented Jun 30, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dusterbloom commented Jun 30, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependency

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 30, 2026 •

edited by cubic-dev-ai Bot

Loading