perf(cuda): port sample_logits chain to a fused GPU sampler kernel by geometric[bot] · Pull Request #478 · Luce-Org/lucebox-hub

geometric · 2026-07-01T16:16:23Z

Summary

Ports the CPU sample_logits chain (repetition/frequency/presence penalty, softmax(temp), top_p nucleus, multinomial draw) to CUDA so the Qwen3.5 AR decode loop can sample straight off the device logits tensor
instead of paying a full vocab-wide (151,936-float) D2H copy every token.
Penalty application, the softmax reductions, and the draw are now one fused kernel (mode-selected: greedy / sample / emit-probs) using warp-shuffle block reductions, replacing what used to be a separate penalty kernel plus a shared-memory tree reduction.
top_k truncation stays on the CPU but we use partial_sort instead of a full sort, and pure top_p is
GPU-assisted: the GPU computes penalties+softmax and hands back the probability vector for the CPU to truncate.
CPU-side top_p also changed independently of the GPU work: nucleus truncation no longer does a full O(vocab log vocab) sort. It now usesnucleus_cutoff, an std::nth_element-based recursive bisection that
finds the cutoff index in O(vocab) total work regardless of where the nucleus lands.
The GPU path is on by default for CUDA builds; opt out with DFLASH_GPU_SAMPLE=0.

Impact

Kernel Level

DFLASH_SAMPLER_BENCH=1 ./test_gpu_sampler_cuda

[microbench] vocab=151936 iters=1000 (per call; >1.0x = GPU faster)
  greedy (temp=0)        CPU   133.87 us | GPU+H2D   132.36 us (1.01x) | GPU devptr    47.34 us (2.83x)
  temp=0.8 (full vocab)  CPU   233.28 us | GPU+H2D   232.35 us (1.00x) | GPU devptr   146.70 us (1.59x)
  temp=0.8 rep_pen=1.2   CPU   245.39 us | GPU+H2D   245.75 us (1.00x) | GPU devptr   167.37 us (1.47x)

End-to-End

Baseline

Compute on CPU

Run via: DFLASH_SAMP=0.8,1.0,0,1.1,42 DFLASH_GPU_SAMPLE=0 python -m server.scripts.bench_llm --bench HumanEval

Note: In the above command temp=0.8, top-p=1.0, top-k=0, rep_pen=1.1,r andom_seed=42

[bench] === SUMMARY ===
Task                AR    DFlash      AL   Speedup     Score
HumanEval        34.45     69.33    5.96     2.01x

New Kernel

Run via: DFLASH_SAMP=0.8,1.0,0,1.1,42 DFLASH_GPU_SAMPLE=1 python -m server.scripts.bench_llm --bench HumanEval
Results: An approx ~32% increase in tok/s with similar Acceptance Lengths.

[bench] === SUMMARY ===
Task                AR    DFlash      AL   Speedup     Score
HumanEval        34.45     92.30    6.06     2.68x

Implementation

server/src/common/sampler.cpp — added nucleus_cutoff (nth_element-based O(vocab) bisection, replaces a full sort for top_p) and draw_from_weights (deduplicates the final weighted CDF draw); wired the two GPU dispatch points (full-GPU for greedy/temp, GPU-assisted for pure top_p) into sample_logits.
server/src/common/geometric_sampler_cuda.cu/h — new/rewritten: single fused geometric_sample_kernel (mode-selected greedy / sample / emit-probs) doing penalty application, softmax reductions, and the multinomial draw in one launch, with warp-shuffle block reductions and a per-device pick_block_size.
server/src/qwen35/qwen35_backend.cpp — AR decode now calls geometric_sample_logits_cuda directly on the device logits tensor when the sampler config is GPU-supported, skipping the vocab-wide D2H copy the CPU chain otherwise needs.
server/CMakeLists.txt — added the DFLASH_GPU_SAMPLER build option (default ON) that compiles geometric_sampler_cuda.cu into dflash_common, and registered the test_gpu_sampler_cuda ctest target.
server/test/test_gpu_sampler_cuda.cpp — new correctness test: GPU vs CPU agreement for greedy, greedy+penalties, temperature-sample distribution, and top_k/top_p CPU-fallback signaling.
server/test/test_dflash.cpp — added --samp=temp,top_p,top_k,rep_pen,seed[,freq,pres] to the positional (non-daemon) harness so benchmarks can exercise the sampler chain.
server/scripts/bench_llm.py — added DFLASH_SAMP (forwards the sampler tail to test_dflash --samp=) and DFLASH_N_SAMPLE (overrides prompts-per-dataset) env vars.
README.md — documented GPU sampler coverage, runtime/build flags, and the benchmark table below.

Runtime Flags / Configuration

Default-on paths:

DFLASH_GPU_SAMPLE — on by default on CUDA builds; handles greedy and plain temperature/penalty sampling entirely on GPU, and assists pure top_p.
Disable path:
DFLASH_GPU_SAMPLE=0 — opt out at runtime; every call falls back to the CPU chain.
-DDFLASH_GPU_SAMPLER=OFF (CMake option, default ON) — drop geometric_sampler_cuda.cu from the build entirely.
Debug/profiling-only flags:
--samp=temp,top_p,top_k,rep_pen,seed[,freq,pres] (test_dflash positional harness) — exercise the sampler chain instead of greedy decode.
DFLASH_SAMP=temp,top_p,top_k,rep_pen,seed[,freq,pres] / DFLASH_N_SAMPLE=N (bench_llm.py) — forward the same sampler tail to every DFlash bench call, and override the per-dataset prompt count.
top_k (with or without top_p) is intentionally never routed to the GPU
— its CPU partial_sort cost scales with k, not vocab, and a GPU round
trip (kernel launch + D2H copy) measured as a net regression, not just a
non-win. This is a deliberate, measurement-driven exclusion, not a gap.

Notes

top_p support directly on the GPU kernel (rather than GPU-assisted) is
deliberately out of scope for this PR;

The draft top-K + logsumexp kernel launched only n_positions (~15) blocks, leaving most of the GPU's SMs idle, and kept its per-thread top-K in a data-dependent insertion index that the compiler spilled to local memory and re-read on every vocab element. Both made the ~9 MB vocab scan run at a small fraction of peak DRAM bandwidth. Rework into a split-K two-pass design: - pass 1 (draft_topk_partial) splits each position's vocab scan across many blocks (2D grid n_positions x split) so all SMs stay busy; - pass 2 (draft_topk_combine) merges the per-split partials per position. Template both kernels on K (compile-time) so the top-K stays register-resident via a branchless unrolled bubble instead of spilling, and read logits as float4 (one coalesced 16-byte transaction per 4 logits) with a scalar fallback when a row base is not 16-byte aligned (vocab % 4 != 0), preserving any-vocab correctness. split is auto-tuned (env override DFLASH_TOPK_SPLIT). Measured on an RTX 3090 (n=15, vocab=151936, K=8): - GPU kernel time: 392 us -> 36.3 us (30.6 partial + 5.75 combine), 10.8x - full call (kernel+sync+D2H): 0.407 ms -> 0.053 ms, 7.7x Full-call speedup is 5.9-8.4x across n in {7,15,31,63}. Output is bit-for-bit equivalent to the CPU reference (id_mismatches=0) across K in {1,2,4,8} and both aligned and odd vocab; compute-sanitizer memcheck clean on both paths. Adds bench_topk.cu, a standalone microbenchmark + CPU-reference correctness harness (not wired into the build) used to profile and A/B this change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01T55hNb5cgyCwNYnNAE1hun

…rnels

…_kernels

davide221 · 2026-07-02T09:21:08Z

@pramodith @DeanoC it seems that the PR has some conflict, would be useful for us to be able to visualize them

pramodith · 2026-07-02T10:07:34Z

@davide221 sorry the merge conflict was because a parallel worktree messed up the llama.cpp file. I cleaned it up.

pramodith · 2026-07-02T14:34:51Z

New results after latest commit for coalescing the cuda reads and writes.

[microbench] vocab=151936 iters=1000 (per call; >1.0x = GPU faster)
  greedy (temp=0)        CPU   113.97 us | GPU+H2D   113.90 us (1.00x) | GPU devptr    31.01 us (3.68x)
  temp=0.8 (full vocab)  CPU   149.51 us | GPU+H2D   148.23 us (1.01x) | GPU devptr    65.32 us (2.29x)
  temp=0.8 rep_pen=1.2   CPU   161.96 us | GPU+H2D   161.86 us (1.00x) | GPU devptr    85.01 us (1.91x)

End-to-End

Baseline CPU only path:

[bench] === SUMMARY ===
Task                AR    DFlash      AL   Speedup     Score
HumanEval        34.18     73.37    6.30     2.15x

GPU Path:

[bench] === SUMMARY ===
Task                AR    DFlash      AL   Speedup     Score
HumanEval        34.07     95.95    6.28     2.82x

pramodith and others added 16 commits June 18, 2026 15:21

avoid redundant logits read

cda8468

topk kernels

252b47b

Merge remote-tracking branch 'origin/main' into pramodith/optimize_topk

ba08748

Merge branch 'Luce-Org:main' into pramodith/optimize_topk

ca40977

add test cases

3b2d87b

register test file

6119847

update comments in draft_topk

9407d91

temp + rep_penalty kernels

be9ba4d

Merge remote-tracking branch 'origin/main' into pramodith/sampling_ke…

62978e2

…rnels

Merge and update Readme.md for sampling flags.

bb137cf

get rid of top-p support

6aa3c7c

more tests to dedicated file

2a1f617

have only one fused kernel that does softmax+penalty+draw

47d0bf4

add more tests and fix randm seeding

dac6319

Merge remote-tracking branch 'origin/main' into geometric_ai/sampling…

e356664

…_kernels

geometric Bot marked this pull request as ready for review July 1, 2026 16:48

coalesce reads + writes.

ac2d441

davide221 mentioned this pull request Jul 2, 2026

perf(laguna): DSpark Markov head + spec-verify stack - 206 to 249 tok/s (RTX 3090) #482

Merged

cheese-cakee mentioned this pull request Jul 5, 2026

perf(hip): compile DFlash GPU draft top-K kernel for HIP #488

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(cuda): port sample_logits chain to a fused GPU sampler kernel#478

perf(cuda): port sample_logits chain to a fused GPU sampler kernel#478
geometric[bot] wants to merge 17 commits into
Luce-Org:mainfrom
GeometricAGI:geometric_ai/sampling_kernels

geometric Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

davide221 commented Jul 2, 2026

Uh oh!

pramodith commented Jul 2, 2026

Uh oh!

pramodith commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

geometric Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Impact

Kernel Level

End-to-End

Implementation

Runtime Flags / Configuration

Notes

Uh oh!

davide221 commented Jul 2, 2026

Uh oh!

pramodith commented Jul 2, 2026

Uh oh!

pramodith commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

geometric Bot commented Jul 1, 2026 •

edited

Loading