Skip to content

perf(cuda): port sample_logits chain to a fused GPU sampler kernel#478

Open
geometric[bot] wants to merge 17 commits into
Luce-Org:mainfrom
GeometricAGI:geometric_ai/sampling_kernels
Open

perf(cuda): port sample_logits chain to a fused GPU sampler kernel#478
geometric[bot] wants to merge 17 commits into
Luce-Org:mainfrom
GeometricAGI:geometric_ai/sampling_kernels

Conversation

@geometric

@geometric geometric Bot commented Jul 1, 2026

Copy link
Copy Markdown

Summary

  • Ports the CPU sample_logits chain (repetition/frequency/presence penalty, softmax(temp), top_p nucleus, multinomial draw) to CUDA so the Qwen3.5 AR decode loop can sample straight off the device logits tensor
    instead of paying a full vocab-wide (151,936-float) D2H copy every token.

  • Penalty application, the softmax reductions, and the draw are now one fused kernel (mode-selected: greedy / sample / emit-probs) using warp-shuffle block reductions, replacing what used to be a separate penalty kernel plus a shared-memory tree reduction.

  • top_k truncation stays on the CPU but we use partial_sort instead of a full sort, and pure top_p is
    GPU-assisted: the GPU computes penalties+softmax and hands back the probability vector for the CPU to truncate.

  • CPU-side top_p also changed independently of the GPU work: nucleus truncation no longer does a full O(vocab log vocab) sort. It now usesnucleus_cutoff, an std::nth_element-based recursive bisection that
    finds the cutoff index in O(vocab) total work regardless of where the nucleus lands.

  • The GPU path is on by default for CUDA builds; opt out with DFLASH_GPU_SAMPLE=0.

Impact

Kernel Level

DFLASH_SAMPLER_BENCH=1 ./test_gpu_sampler_cuda

[microbench] vocab=151936 iters=1000 (per call; >1.0x = GPU faster)
  greedy (temp=0)        CPU   133.87 us | GPU+H2D   132.36 us (1.01x) | GPU devptr    47.34 us (2.83x)
  temp=0.8 (full vocab)  CPU   233.28 us | GPU+H2D   232.35 us (1.00x) | GPU devptr   146.70 us (1.59x)
  temp=0.8 rep_pen=1.2   CPU   245.39 us | GPU+H2D   245.75 us (1.00x) | GPU devptr   167.37 us (1.47x)

End-to-End

Baseline

Compute on CPU

Run via: DFLASH_SAMP=0.8,1.0,0,1.1,42 DFLASH_GPU_SAMPLE=0 python -m server.scripts.bench_llm --bench HumanEval

Note: In the above command temp=0.8, top-p=1.0, top-k=0, rep_pen=1.1,r andom_seed=42

[bench] === SUMMARY ===
Task                AR    DFlash      AL   Speedup     Score
HumanEval        34.45     69.33    5.96     2.01x

New Kernel

Run via: DFLASH_SAMP=0.8,1.0,0,1.1,42 DFLASH_GPU_SAMPLE=1 python -m server.scripts.bench_llm --bench HumanEval
Results: An approx ~32% increase in tok/s with similar Acceptance Lengths.

[bench] === SUMMARY ===
Task                AR    DFlash      AL   Speedup     Score
HumanEval        34.45     92.30    6.06     2.68x

Implementation

  • server/src/common/sampler.cpp — added nucleus_cutoff (nth_element-based O(vocab) bisection, replaces a full sort for top_p) and draw_from_weights (deduplicates the final weighted CDF draw); wired the two GPU dispatch points (full-GPU for greedy/temp, GPU-assisted for pure top_p) into sample_logits.

  • server/src/common/geometric_sampler_cuda.cu/h — new/rewritten: single fused geometric_sample_kernel (mode-selected greedy / sample / emit-probs) doing penalty application, softmax reductions, and the multinomial draw in one launch, with warp-shuffle block reductions and a per-device pick_block_size.

  • server/src/qwen35/qwen35_backend.cpp — AR decode now calls geometric_sample_logits_cuda directly on the device logits tensor when the sampler config is GPU-supported, skipping the vocab-wide D2H copy the CPU chain otherwise needs.

  • server/CMakeLists.txt — added the DFLASH_GPU_SAMPLER build option (default ON) that compiles geometric_sampler_cuda.cu into dflash_common, and registered the test_gpu_sampler_cuda ctest target.

  • server/test/test_gpu_sampler_cuda.cpp — new correctness test: GPU vs CPU agreement for greedy, greedy+penalties, temperature-sample distribution, and top_k/top_p CPU-fallback signaling.

  • server/test/test_dflash.cpp — added --samp=temp,top_p,top_k,rep_pen,seed[,freq,pres] to the positional (non-daemon) harness so benchmarks can exercise the sampler chain.

  • server/scripts/bench_llm.py — added DFLASH_SAMP (forwards the sampler tail to test_dflash --samp=) and DFLASH_N_SAMPLE (overrides prompts-per-dataset) env vars.

  • README.md — documented GPU sampler coverage, runtime/build flags, and the benchmark table below.

Runtime Flags / Configuration

Default-on paths:

  • DFLASH_GPU_SAMPLE — on by default on CUDA builds; handles greedy and plain temperature/penalty sampling entirely on GPU, and assists pure top_p.
    Disable path:
  • DFLASH_GPU_SAMPLE=0 — opt out at runtime; every call falls back to the CPU chain.
  • -DDFLASH_GPU_SAMPLER=OFF (CMake option, default ON) — drop geometric_sampler_cuda.cu from the build entirely.
    Debug/profiling-only flags:
  • --samp=temp,top_p,top_k,rep_pen,seed[,freq,pres] (test_dflash positional harness) — exercise the sampler chain instead of greedy decode.
  • DFLASH_SAMP=temp,top_p,top_k,rep_pen,seed[,freq,pres] / DFLASH_N_SAMPLE=N (bench_llm.py) — forward the same sampler tail to every DFlash bench call, and override the per-dataset prompt count.
    top_k (with or without top_p) is intentionally never routed to the GPU
    — its CPU partial_sort cost scales with k, not vocab, and a GPU round
    trip (kernel launch + D2H copy) measured as a net regression, not just a
    non-win. This is a deliberate, measurement-driven exclusion, not a gap.

Notes

top_p support directly on the GPU kernel (rather than GPU-assisted) is
deliberately out of scope for this PR;

pramodith and others added 16 commits June 18, 2026 15:21
The draft top-K + logsumexp kernel launched only n_positions (~15) blocks,
leaving most of the GPU's SMs idle, and kept its per-thread top-K in a
data-dependent insertion index that the compiler spilled to local memory and
re-read on every vocab element. Both made the ~9 MB vocab scan run at a small
fraction of peak DRAM bandwidth.

Rework into a split-K two-pass design:
  - pass 1 (draft_topk_partial) splits each position's vocab scan across many
    blocks (2D grid n_positions x split) so all SMs stay busy;
  - pass 2 (draft_topk_combine) merges the per-split partials per position.
Template both kernels on K (compile-time) so the top-K stays register-resident
via a branchless unrolled bubble instead of spilling, and read logits as float4
(one coalesced 16-byte transaction per 4 logits) with a scalar fallback when a
row base is not 16-byte aligned (vocab % 4 != 0), preserving any-vocab
correctness. split is auto-tuned (env override DFLASH_TOPK_SPLIT).

Measured on an RTX 3090 (n=15, vocab=151936, K=8):
  - GPU kernel time: 392 us -> 36.3 us (30.6 partial + 5.75 combine), 10.8x
  - full call (kernel+sync+D2H): 0.407 ms -> 0.053 ms, 7.7x
Full-call speedup is 5.9-8.4x across n in {7,15,31,63}. Output is bit-for-bit
equivalent to the CPU reference (id_mismatches=0) across K in {1,2,4,8} and
both aligned and odd vocab; compute-sanitizer memcheck clean on both paths.

Adds bench_topk.cu, a standalone microbenchmark + CPU-reference correctness
harness (not wired into the build) used to profile and A/B this change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01T55hNb5cgyCwNYnNAE1hun
@geometric geometric Bot marked this pull request as ready for review July 1, 2026 16:48
@davide221

Copy link
Copy Markdown
Contributor

@pramodith @DeanoC it seems that the PR has some conflict, would be useful for us to be able to visualize them

@pramodith

Copy link
Copy Markdown
Contributor

@davide221 sorry the merge conflict was because a parallel worktree messed up the llama.cpp file. I cleaned it up.

@pramodith

Copy link
Copy Markdown
Contributor

New results after latest commit for coalescing the cuda reads and writes.

[microbench] vocab=151936 iters=1000 (per call; >1.0x = GPU faster)
  greedy (temp=0)        CPU   113.97 us | GPU+H2D   113.90 us (1.00x) | GPU devptr    31.01 us (3.68x)
  temp=0.8 (full vocab)  CPU   149.51 us | GPU+H2D   148.23 us (1.01x) | GPU devptr    65.32 us (2.29x)
  temp=0.8 rep_pen=1.2   CPU   161.96 us | GPU+H2D   161.86 us (1.00x) | GPU devptr    85.01 us (1.91x)

End-to-End

Baseline CPU only path:

[bench] === SUMMARY ===
Task                AR    DFlash      AL   Speedup     Score
HumanEval        34.18     73.37    6.30     2.15x   

GPU Path:

[bench] === SUMMARY ===
Task                AR    DFlash      AL   Speedup     Score
HumanEval        34.07     95.95    6.28     2.82x      

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants