perf(qwen35): recover dense AR replay fast path#476
Draft
dusterbloom wants to merge 6 commits into
Draft
Conversation
In build_delta_net_block, replace the two-op silu(z_4d) + mul(output, z_silu) pattern with ggml_swiglu_split(z_4d, output_n). swiglu_split(a,b)=silu(a)*b, mathematically identical. Removes 30 kernel launches per AR decode step (UNARY-30, MUL-30, GLU+30; graph 2744->2714). Token-identical at temp=0 confirmed. Part 1 of launch-count gap closure (30/162 = 18.5% of gap). (cherry picked from commit cfc9ff7)
…re AR FWHT gate: DFLASH_NO_WHT=1 disables the K/Q rotation that costs ~16 extra kernels/token (~3% decode). q4_0 KV doesn't need the outlier spreading on Ampere. Default off for TQ3_0 (WHT baked into quant). Repeat_back skip: in pure AR (no tree/capture), the SSM kernel's broadcast handles the head mismatch natively — skip the repeat_back copy per DeltaNet layer per step (~1% decode). (cherry picked from commit 309b1ce)
The DeltaNet conv_input concat (conv_states_r, qkv_T) hit the slow concat_f32_dim0 kernel (15us) because both inputs were non-contiguous (reshape + transpose). Wrapping in ggml_cont routes to the fast concat_cont path (3.3us) — ~0.5ms/token saved in the 30-layer serial DeltaNet recurrence. (cherry picked from commit ca0ba8d)
…n FA bucket build_target_step rebuilds the 2744-node cgraph every decode step, costing ~0.38ms/token and resetting ggml-cuda's CUDA-graph warmup counter. The only per-step topology change is win_len_padded = round_up(committed+1, 256), which only advances every 256 steps. Gate the rebuild on committed/256; within a bucket reuse the cached graph and update only mutable inputs. DFLASH_AR_NO_REUSE=1 restores the per-step rebuild. Measured: +6% decode tok/s (106→113) on Q3_K_XL, q4_0 KV, 16K ctx. (cherry picked from commit dbf4ed9)
Move ggml_cont from the concat call to the qkv_T transpose definition. This makes qkv_T contiguous once (1 cont per DeltaNet layer) instead of wrapping both concat inputs (2 conts per layer). conv_states_r is already contiguous (reshape of contiguous cache tensor). Net: -30 graph nodes (2704→2674), compute -24µs/step. (cherry picked from commit e3db53f)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extract the server-only dense AR replay micro-optimizations from the replay-win branch into a focused PR that does not depend on ggml PR24 or any submodule bump.
This keeps the in-place GDN/submodule work isolated in #473, while letting reviewers evaluate the independent Qwen35 AR decode fast path now.
What changed
ggml_swiglu_split.q4_0by default, while preserving explicit overrides:DFLASH_FORCE_WHT=1restores the old graph-level rotation.DFLASH_NO_WHTdisables it.qkv_Tto reduce concat-cont overhead.Scope and dependencies
server/deps/llama.cppsubmodule change.ggml_cuda_set_skip_props_check/ props-check hook.Why
Our best 6-turn dense replay result recovered a decode-rate win over llama.cpp, but #473 mixed those micro-opts with the ggml-dependent in-place GDN consumer. Splitting this lets the low-risk server-side perf work land independently and keeps the reviewer surface small.
Latest measured replay context from the winning stack:
6.900s, decode5.394s, wall12.294s, weighted decode42.642 tok/s.8.1-8.7s, decode about5.28-5.40s, wall about13.6-14.1s, weighted decode about43.6 tok/s.This PR carries the server-only part of that decode recovery. Prefill parity remains separate work.
Validation
Local focused worktree, upstream ggml submodule
a317b0ea0fc9eb716a311976fed8dc0f301dc09f:env CC=/usr/bin/gcc-11 CXX=/usr/bin/g++-11 CUDACXX=/usr/local/cuda-12.6/bin/nvcc LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib cmake -S server -B server/build-ar-replay-microopts-gcc11 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 -DGGML_CCACHE=OFFenv LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib cmake --build server/build-ar-replay-microopts-gcc11 --target dflash_server test_server_unit -j 8env LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib server/build-ar-replay-microopts-gcc11/test_server_unit2022 assertions, 0 failuresNotes:
/usr/bin/nvcc12.0 + GCC 13 failed at CUDA compiler identification due the known glibc_Float*incompatibility; CUDA 12.6 + GCC 11 configured and built successfully.