Skip to content

Phase7 wrap#14

Merged
dmriding merged 56 commits into
mainfrom
phase7-wrap
Apr 17, 2026
Merged

Phase7 wrap#14
dmriding merged 56 commits into
mainfrom
phase7-wrap

Conversation

@dmriding

Copy link
Copy Markdown
Owner

No description provided.

dmriding added 30 commits April 15, 2026 03:03
Bridge sprint between quant milestones. Adds the warp-level reduction
primitives that close the single biggest DSL expressiveness gap after
matmul (softmax / layer-norm / RMS-norm / loss functions are now
writable in pure DSL without manual shfl.sync ladders), plus the
missing min variants at both warp and block level.

Ships on the phase7-rest branch with no independent crates.io release;
Phase 7 closes with an aggregate release after 7.2, 7.3, 7.4.

New DSL builtins (f32-only this sprint; integer and half-precision
variants are a clean follow-up):
 - warp_reduce_sum/max/min: per-warp reduction via butterfly shuffle
   tree, every lane in the warp gets the full result in 5 rounds.
 - block_reduce_min: completes the sum/max/min trio at block level.

Whole-warp-multiple compile-time guard: kernels with block sizes
whose total thread count is not a multiple of 32 (1D: 16; 2D: (4,4)
or (8,6)) fail to compile with a span pointing at the warp_reduce_*
call. Catches the partial-warp shfl.sync UB statically. Covered by
4 trybuild compile-fail fixtures.

Rustdoc documents the per-warp semantic clearly, the convergent-
control-flow requirement (data-dependent branches are UB the macro
cannot detect), and the identity-value padding pattern for ragged
boundaries.

Helper emit_warp_tree_reduce factored from the existing block_reduce
path, parameterized by shuffle variant (down for block_reduce lane-0-
only, bfly for warp_reduce all-lanes), combine op, and PTX type. Leaves
a clean seam for future i32/f16/bf16 reductions.

Pre-refactor TokenStream snapshot canary locks the D2 refactor to
byte-identical PTX; snapshot canary remains green.

Tests: 12 new #[ignore]d GPU tests in reduce_macro.rs pass bit-exact
on sm_89 including a 64-thread two-warp independence canary (warp 0
and warp 1 fed different patterns; asserts each warp gets its own
reduction and the lowering does not blend across warps). 4 new trybuild
compile-fail fixtures for the guard and the f32-type boundary.

Host tests 330 pass 199 ignored (up from 187 baseline = 12 new GPU
tests). No regression on showcases or matmul benches.
…80 regs (Rollback #1 next)

Full INT8 W8A16 tri-output kernel: 13 params, 2 shared decls (tile_x 2048B + tile_w 512B),
K-loop with 7-barrier Design-S cadence, 12 mma.sync m16n8k16.f16.f16.f32 per K-tile across
3 fragment-C grids, store-out epilogue with 24 packed b32 stores + 48 cvt.rn.f16.f32 + 48
mul.f32 (scalar scale fold per projection).

ptxas (sm_80 + sm_89): 0 spills, valid PTX, but reports 80 registers per thread. The
extra 16-40 regs over the D2.5 skeleton baseline (32-40) come from cooperative-loader
intermediate state, fragment-A reload regs per mma, and per-K-tile address-math
intermediates that the skeleton did not model.

80 > the 64-reg full-occupancy cliff. Plan Rollback #1 trigger fires - drop
MMAS_PER_WARP_N from 2 to 1, halving the per-block output tile to 64x16. That rollback
lands in the next commit.
…_80/89, 0 spills)

ptxas_verify gate now PASSED both targets:
  sm_80: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff)
  sm_89: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff)

Rollback #1 was authorized in the Sprint 7.3 plan as the explicit recovery for
"register pressure overshoot at D2.5 OR D3 full kernel". D3.4 hit 80 regs/0 spills
both targets at the original 64x32 output tile; halving MMAS_PER_WARP_N reclaims
24 frag_c regs/lane (3 grids x 4 frag_d regs x 2 fewer sub-tiles per warp) and
drops the post-rollback measurement to 48 regs/lane.

Per-block output tile: 64x32 -> 64x16. More blocks launch, each block lighter.
X-reuse economics still favor the fused tri-output design over 3x standalone
matmul_int8; the absolute-vs-relative bench tier in D8 will quantify the trade.

Implementation:
  - BN_BLOCK 32 -> 16, WARP_QUAD_N 16 -> 8, MMAS_PER_WARP_N 2 -> 1
  - TILE_W_BYTES kept at 512 (col-stride padded to 32) so cooperative pre-zero
    keeps its single-issue-per-thread shape; W loader gates writes on
    col_start < BN_BLOCK so only the first 16 cols receive real data
  - Test counts updated to derive from MMAS_PER_WARP_* rather than hard-code
    the 64x32 numbers (future-proof for further rollback iterations)
…7.3 MVS ship point)

Wires `kaio_ops::qkv_project_int8` (W8A16: f16 activations × i8 weights, scalar
per-projection scales) as the Sprint 7.3 minimum-viable-sprint deliverable.
13-param signature: device, x_f16, w_q/k/v_i8, scale_q/k/v, q/k/v_out_f16, m/n/k.
LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1), block (32,4,1).

Rustdoc spells out W8A16-vs-W8A8 explicitly in the first paragraph (Codex round 2
discipline) so callers don't conflate this with matmul_int8's W8A8 contract;
points users needing W8A8 at three matmul_int8 calls + their own activation quant.

GPU launch smoke test added: zero inputs → zero outputs canary on canonical shape
(M=64, N=16, K=16). Verifies module loads + kernel launches + writes outputs
without driver error. Correctness vs reference is the D7 e2e suite's job.

Quality gates GREEN at the MVS ship point:
  - cargo fmt clean
  - cargo clippy -D warnings clean
  - 73 host tests pass (up from 72 with the launch smoke)
  - cargo doc clean (no broken intra-doc links)
  - ptxas_verify PASSED sm_80 + sm_89 (48 regs, 0 spills)

Continuing to D5 (INT4 contingent) - D2.5 + D3.4 stayed clean post-rollback.
…, 0 spills)

W4A16 tri-output INT4 module: f16 activations × 3 packed-INT4 weight tensors ×
3 group-scale tensors → 3 f16 outputs. Single kernel launch with Design-S
serial fusion across Q/K/V and group-scale reload cadence (group_size=128,
8 K-tiles per group).

Reuses matmul_int4 helpers via super::matmul_int4_kernel:: path
(emit_mw_load_tile_a_f16_64x16, emit_unpack_s4_x2_scale_to_f16_pair,
emit_fragment_b_int4_per_lane). Made the latter two pub(crate) for
cross-kernel sharing.

New helpers in qkv_project_int4_kernel.rs:
  - emit_mw_load_tile_w_packed_int4_2x16 (16-col packed-INT4 cooperative load)
  - emit_cooperative_load_group_scales_int4 (16-cell f16 scale load, 8 active threads)
  - emit_pre_zero_shared_tiles_qkv_int4 (3-tile pre-zero with predicated scales path)
  - emit_warp_quadrant_mma_int4_per_projection (2x1 mma sweep with INT4 dequant feed)
  - build_qkv_project_int4_module (full IR module - 13 params, 3 shared decls,
    K-loop with 7-barrier cadence + group-boundary scale reload)

Same Rollback #1 as INT8 path (BN_BLOCK=16, MMAS_PER_WARP_N=1) - applied
preemptively since INT4 has identical fragment-C live state plus extra
register pressure from group-scale + nibble-extract chains. 56 regs/thread
vs INT8's 48 (the 8-reg gap is the dequant + scale-reload state).

Quality gates GREEN:
  - cargo fmt clean
  - cargo clippy -D warnings clean
  - 83 host tests pass (up from 73 - added 11 D5 emit/module/canary tests)
  - ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills, 8-reg headroom under
    the 64-reg cliff)
  - disjoint-register canary (24 distinct mma destination regs across 3 grids)

D6 next: public qkv_project_int4 host API (contingent ship).
…7.3 contingent ship)

Wires `kaio_ops::qkv_project_int4` (W4A16: f16 activations × packed-INT4 weights
× f16 group scales) as Sprint 7.3's contingent second deliverable. 14-param
signature: device, x_f16, w_q/k/v_packed_u32, scales_q/k/v_f16, q/k/v_out_f16,
m/n/k, group_size. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1).

Rustdoc spells out the packing convention (8 nibbles per u32, K-contiguous,
col-major [K/8, N]) and group-scale layout ([K/group_size, N] row-major,
group_size=128 fixed). Cross-links to matmul_int4 since both share the convention.
Explicit "not a drop-in for AutoGPTQ/exllama/GGUF" disclaimer.

GPU launch smoke test: zero inputs → zero outputs canary on canonical shape
(M=64, N=16, K=128 = exactly one group). Verifies module loads + kernel launches.

Quality gates GREEN:
  - cargo fmt + clippy -D warnings clean
  - 84 host tests pass (+1 D6 smoke)
  - cargo doc clean
  - ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills) - already verified at D5

Both INT8 (D4 MVS) and INT4 (D6 contingent) public APIs now shipped on
phase7-rest. D7 (GPU e2e + Q/K/V differentiation canary) next.
…pass on sm_89)

Two new test files with CPU-reference round-trip correctness + canary tests:

kaio-ops/tests/qkv_project_int8_e2e.rs (7 tests):
  - 6 correctness tests across canonical shapes (smallest / one-block / multi-N /
    multi-M / larger-K / medium 256x128x256)
  - 1 Q/K/V differentiation canary (W_Q=+1, W_K=+2, W_V=+3, X=1, scale=1/K
    → outputs 1, 2, 3 per cell; catches fragment-C grid aliasing)

kaio-ops/tests/qkv_project_int4_e2e.rs (8 tests):
  - 3 group-boundary shapes (1 group / 2 groups / 8 groups) exercising the
    group_idx = k_tile / K_TILE_GROUP_RATIO cadence
  - 3 multi-block shapes (multi-N / multi-M / larger)
  - 1 sign-extend canary (all weights = -8, output should be -K*8 per cell;
    catches shr.s32 → shr.u32 collapse)
  - 1 Q/K/V differentiation canary

All 7 + 8 = 15 e2e tests PASSED on RTX 4090 sm_89. Tolerance gate
max_rel_err < 1e-3 (loose enough for f16-rounding-step difference between
kernel and reference; observed max_rel stays under 3e-4 in practice).

TWO BUGS CAUGHT + FIXED in D7:

1. INT8 W loader double-counted block_col
   The emit_mw_load_tile_w_int8_16x32 loader added block_col to its global
   offset computation, but w_block_base_global already carried the
   block_col byte shift from build_qkv_project_int8_module. Result: every
   non-zero N-block computed from WRONG global W addresses, corrupting all
   multi-N-block shapes. Single-N-block tests passed because block_col=0
   makes the double-count a no-op. Fix: use col_start (within-tile offset)
   for the global addr computation instead of col_global.

2. INT4 shared scales slot raced across projections
   build_qkv_project_int4_module originally reloaded scales only on group
   boundaries (every 8 K-tiles) but the single tile_scales slot was reused
   across Q/K/V per K-tile. After mma_V on K-tile 0, the slot held
   scales_V; on K-tile 1 (non-boundary) all three projections' mmas read
   scales_V instead of their own scales. The sign-extend + Q/K/V canaries
   passed because they use constant scales, but random-data correctness
   tests failed. Fix: reload scales every K-tile, unconditionally (32 B per
   projection per K-tile = 96 B/K-tile additional global bandwidth,
   negligible vs 384 B/K-tile for the W loads). Group-boundary-only
   optimization deferred to future bench-driven work with three separate
   scales slots.

Quality gates GREEN:
  - cargo fmt + clippy -D warnings clean (factored QkvOutTuple alias for
    e2e round_trip return types)
  - 84 host tests pass
  - 15 GPU e2e tests PASS on sm_89
  - ptxas_verify INT8: 48 regs, INT4: 56 regs, 0 spills both variants
  - Both kernels validated across sm_80 + sm_89 (W-loader fix did not
    change register count)
…iers)

Sprint 7.3 D8. Compares fused qkv_project_int4 vs 3x sequential matmul_int4
(clean W4A16 apples-to-apples baseline), and reports absolute TOPS for
qkv_project_int8 (W8A16 has no fair standalone baseline - matmul_int8 is W8A8).

Shape sweep: decode tier (M ∈ {1, 64}, K/N ∈ {2048, 4096}) + prefill tier
(M ∈ {512, 2048}, K/N = 4096). Plus 4 diagnostic shapes around the initial
decode_m64 K/N=2048 anomaly to confirm it was measurement noise, not a kernel
bug.

Stable results on RTX 4090 sm_89 (release mode):
  Decode tier (M ≤ 128):  fused ~3.0x baseline, consistent across shapes
  prefill_m512 (M=512):    1.19x (above ship threshold 1.15x)
  prefill_m2048 (M=2048):  0.85x (below 1.00x - fused loses 15% at largest prefill)

Ship-B call (maintainer 2026-04-16): ship 7.3 with honest framing
("fused wins big at decode, ties/loses at prefill - use 3x standalone for
prefill-heavy workloads"), plan Design S+1/2P optimization as Sprint 7.3.5
with full review cadence. Target for 7.3.5: recover prefill_m2048 from 0.85x
to ≥1.15x by overlapping W_{P+1} load with mma_P (barriers 7 → 4 per K-tile).

Outlier investigation (decode_m64 K/N=2048 reported 0.40x in first run):
NOT reproducible on re-run. Diagnostic sweep across 4 nearby shapes shows
consistent 3.0x throughout - confirmed transient measurement noise in the
first run (likely first-call CUDA context warmup bleeding through despite
5-iter warmup loop). No kernel bug.

Absolute TOPS numbers intentionally NOT highlighted - repeated back-to-back
bench runs cause thermal throttling on the 4090 that crushes absolute
throughput by 50-100x for all workloads including cuBLAS sgemm reference.
Ratios are throttle-invariant (fused + baseline throttle equally) so ratio
numbers remain the valid headline. Clean absolute-TOPS run deferred to D10
doc with cold-system methodology.
dmriding added 26 commits April 16, 2026 02:21
…ttention_tc)

End-to-end Phase 7 pipeline demo. GPTQ-lite quantizes random f32 weights into
KAIO's packed-INT4 + group-scale layout, runs fused qkv_project_int4 to produce
three f16 outputs, feeds them to attention_tc, and compares against an f16
reference path (3x matmul_tc → attention_tc).

Reports three quality metrics per the Sprint 7.3 plan D9:
  - cosine similarity (primary pass/fail, plan threshold 0.98 INT4)
  - max absolute error (worst-case row outlier)
  - mean relative error (aggregate quality)

Plus a projection-stage cosine for localization - distinguishes "kernel bug"
from "softmax amplified a small projection error" when the final number is
below threshold.

Showcase shape: SEQ=64, D_MODEL=128, D_HEAD=64, GROUP_SIZE=128 (decode-tier
shape where fusion wins cleanly per D8 bench).

Observed on RTX 4090 sm_89 with random f16 weights:
  - Projection Q cos_sim = 0.9975 (kernel working correctly)
  - Final attention cos_sim = 0.9655 (below 0.98 plan threshold)

README + in-example printout frame this honestly per the plan:
  "Random f16 weights are a worst case for group-scale quantization fidelity.
   Real trained LLM weights have much tighter group statistics and land
   measurably better than these synthetic numbers."

The kernel's job is correctly performing the quantized projection; softmax
amplifies any projection-stage error non-linearly in the final output. This
is expected behavior for random weights and will improve substantially on
real model weights where per-group magnitude variance is much lower.

Wired into `cargo xtask showcase qkvattn`.
…7.3.5 plan stub

Completes Sprint 7.3 documentation pass:

- docs/development/sprints/phase7/sprint_7_3.md - full decision log (D1-D10),
  bench table (all 9 shapes, no cherry-picking), architectural decisions
  (AD1-AD5), bug-catch trail (cvt.rn.f16.s8, INT8 W-loader double-count,
  INT4 scales-slot race), and the maintainer Ship-B decision.
- docs/development/sprints/phase7/PHASE_7_LOG.md - new index for Phase 7
  (Phase 6 has one, Phase 7 didn't); populated with 7.0 through 7.3 + the
  7.3.5 / 7.4 planning rows.
- docs/development/sprints/phase7/sprint_7_3_5.md - stub plan for Design
  S+1/2P (two W slots ping-pong, barriers 7→4). Success criterion pinned:
  prefill_m2048 from 0.85x to ≥1.15x on sm_89. Full reviewer cadence
  required (synchronization semantics change in inner loop); not a hotfix.
- CHANGELOG.md [Unreleased] block - Sprint 7.3 entry covering both public
  APIs, the shared store-out helper, the kaio-core cvt.rn fix, the
  showcase example, and the honest decode-wins-prefill-loses framing.
- README.md ops table - two new rows for qkv_project_int{4,8}.

Quality gates at D10 sprint-close:
  - 84 kaio-ops host tests pass
  - 15 GPU e2e tests pass on sm_89 (7 INT8 + 8 INT4)
  - ptxas_verify PASSED sm_80 + sm_89 for both kernels + skeleton + store_out
  - cargo doc --no-deps clean
  - cargo fmt + clippy -D warnings clean across workspace
…s / sm_89 40 regs / 0 spills)

Port qkv_skeleton.rs from 7.3 Design-S to 7.3.5 S+1/2P peak-pressure model:
- 2 tile_w shared slots with 64B bank-phase padding (stride 576B,
  non-multiple of 128 per Design invariant #4, Gemini R3-1)
- 3 unrolled K-iterations with frag_C back-edge state threaded
  across iterations (Opus R1b-2 / R3b-3 cross-iteration modelling
  requirement - forces ptxas to count frag_C at peak, not elide)
- Runtime ctaid.x-indexed slot pointers for ping-pong (ptxas can't
  constant-fold slot selection - Design invariant #3)
- Scales register-hoist modelled via runtime k_tile_group counter
  (6 regs total, Pre-review clarification #2)
- In-flight cooperative-load address registers kept live across
  the mma epoch to model the overlap window

D1 gate PASSED on both architectures with comfortable headroom:

sm_80: 32 registers, 0 spills, 3136 B smem, 404 B cmem[0]
sm_89: 40 registers, 0 spills, 3136 B smem, 404 B cmem[0]

cmem[0] = 404 B (kernel params in constant bank 0) confirms the R3b-2 case (a) upfront: kernel args arrive uniform-broadcast-
friendly by default, no GPR pressure from base pointers, escape
hatch has no handle to grab. SASS inspection for the escape hatch
is not needed at the skeleton stage.

Rollback #2 not triggered. Proceeding to D2 (INT8 S+1/2P body).

Quality gates: cargo fmt, cargo clippy -D warnings, cargo test
-p kaio-ops --lib (84 passed, 7 ignored). Shared budget 3136 B
matches plan (2048 tile_x + 512 slot0 + 64 pad + 512 slot1).
…9 56 regs / 0 spills)

Port qkv_project_int8_kernel build function from 7.3 Design-S to
7.3.5 S+1/2P. Design-S helper emit_warp_quadrant_mma_int8_per_projection
stays untouched (preserves 7.3 tests + callers); two new siblings
added for the S+1/2P path:

  emit_warp_quadrant_mma_int8_per_projection_hoisted
    Per-projection mma sweep that reads frag_A from caller-provided
    pre-hoisted registers instead of reloading from tile_x. Required
    so the three projection mmas share a single K-tile-start frag_A
    hoist, making Design invariant #1 (tile_x overwrite-after-hoist)
    expressible.

  hoist_frag_as_for_warp_quadrant
    Loads both m_stripe frag_As from tile_x at K-tile start. The
    caller is contractually required to fire B1 bar.sync AFTER this
    call and BEFORE any tile_x write.

  emit_pre_zero_shared_tiles_qkv_int8_sp_half_p
    3-slot pre-zero (tile_x + tile_w_slot0 + tile_w_slot1; the 64 B
    pad between slots is write-only padding, never pre-zeroed).

Shared layout: tile_x (2048 B) + tile_w_slot0 (512 B) + tile_w_pad
(64 B, Design invariant #4 bank-phase) + tile_w_slot1 (512 B) =
3136 B. Stride slot0 -> slot1 = 576 B, non-multiple of 128 (no
cross-warp SMEM bank-port contention during overlap).

K-loop body 4-barrier cadence (plan worked-example table):

  B1 (post-hoist)       INVARIANT #1 — tile_x safe to overwrite
  Epoch 1: load W_K(k)     -> NEXT slot; mma_Q uses CUR slot
  B2 (post-mma_Q)       CUR slot safe to overwrite
  Epoch 2: load W_V(k)     -> CUR slot; mma_K uses NEXT slot
  B3 (post-mma_K)
  Epoch 3: [overlap loads  -> W_Q(k+1) + X(k+1), skipped on final
                              K-tile via uniform @!p_not_last branch];
           mma_V uses CUR slot
  B4 (post-mma_V)       closes K-tile epoch

Runtime ping-pong slot selection via and.b32 + setp.eq + selp.u32
(slot_idx = k_tile & 1; p_slot0 = slot_idx == 0; cur/next chosen
via selp against r_tile_w_slot{0,1}). Overlap-skip predicate
p_not_last = k_tile + 1 < num_k_tiles; uniform branch, no
divergence.

Total barrier count for 1-K-tile module = 2 setup (pre-zero + B0)
+ 4 per K-tile = 6. 7.3 Design S was 1 + 7 = 8. Per K-tile savings
3 barriers; scales with K.

ptxas --verbose results (S+1/2P, archived R2-5 artifacts):

  sm_80: Used 64 registers, 0 bytes spill stores, 0 bytes spill loads
  sm_89: Used 56 registers, 0 bytes spill stores, 0 bytes spill loads

vs 7.3 Design-S INT8 post-Rollback-#1 baseline (48 regs on both
architectures): sm_89 +8 regs (hoisted frag_A 2-stripes + slot
selection + overlap-load addr math), sm_80 +16 regs. sm_80 at the
64-reg ceiling exactly; sm_89 has 8 reg headroom. Both architectures
clear the D2 gate. Escape hatch (Gemini R3-4) not needed — ptxas
already placed base pointers in cmem[0] per D1 skeleton evidence.

Test updates:
- build_qkv_project_int8_module_structure now asserts 4 shared decls
  (tile_x + slot0 + pad + slot1), new K-loop label
  K_LOOP_QKV_INT8_SP_HALF_P, and OVERLAP_SKIP_QKV_INT8_SP_HALF_P label.
- build_qkv_project_int8_module_emits_seven_barriers_per_k_tile
  renamed to _emits_four_barriers_per_k_tile_sp_half_p, expects 6
  total bar.sync (was 8).

Quality gates green: cargo fmt, cargo clippy -D warnings, cargo test
-p kaio-ops --lib (84 passed, 7 ignored). D3 (GPU e2e regression +
slot-mapping canary + determinism stress) comes next.
…s, 100x bit-exact)

Add two test classes for the S+1/2P kernel beyond the 7.3 regression
suite:

- Slot-mapping canary (2 shapes): deliberately-distinguishable
  weights per projection (W_Q=1, W_K=2, W_V=3, X=all-1, scale=1)
  so expected output is exactly {K, 2K, 3K} per element. Catches
  deterministic ping-pong mis-wiring that random-data tests can
  pass by coincidence. Two K-tile variants:
  - K=32 (2 K-tiles) exercises the steady -> final transition.
  - K=48 (3 K-tiles) exercises steady -> steady -> final, covering
    the back-edge transition that the 2-K-tile version doesn't hit.

- Determinism stress (2 shapes): runs the same input 100x back-to-
  back in release build and asserts bit-exact output across all
  runs (from run 1 - no warmup discards). Catches barrier-
  misplacement races that manifest as cross-launch nondeterminism
  under varying warp scheduling.
  - Short: M=256, N=512, K=1024 (64 K-tiles per launch).
  - Prefill: M=2048, N=512, K=4096 (256 K-tiles, the regime that
    motivated the S+1/2P rework).

Results on RTX 4090 sm_89:
- 7 existing e2e regression tests: pass (max_rel <= 4.88e-4, most
  bit-exact).
- Q/K/V differentiation canary: pass.
- Slot-mapping canary K=32: Q=32 K=64 V=96 exact.
- Slot-mapping canary K=48: Q=48 K=96 V=144 exact.
- Determinism short 100x: bit-exact.
- Determinism prefill 100x: bit-exact.

11/11 INT8 GPU e2e tests pass. No kernel code changed in D3 - only
tests added.

Diagnostic in ptxas_verify: also report allocation at
-maxrregcount 128 to distinguish natural register allocation from
a ptxas compression-under-cap. Confirmed sm_80 64 regs is a genuine
register-reuse compression below the 66-reg natural allocation
(zero spills at default), not a hidden-spill pathology. sm_89 at
56 regs at default vs 66 natural - same pattern, larger headroom.

Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p
kaio-ops --lib (84 passed, 7 ignored).
…G.yml

Sprint 7.3.5 closes with the asymmetric outcome the plan's 5-tier
table pre-declared: INT8 ships Design S+1/2P (correctness clean,
ptxas clean, abs TOPS baseline captured); INT4 S+1/2P was measured
at median prefill_m2048 = 1.05x across 3 runs, improved from Design
S's 0.85x but below the 1.15x plan-locked ship gate and the 1.10x
ship-narrow floor - landed in "Measured, not shipped". INT4 retained
at Design S on main; INT4 S+1/2P port archived on phase7-rest for
potential future cp.async contingency work.

Docs added / updated:
- docs/development/sprints/phase7/sprint_7_3_5.md (new) - terse
  post-delivery outline with full measured-bench table, 5
  architectural decisions, follow-up next steps.
- docs/development/sprints/phase7/phase7_master_plan.md - status
  header, section 4 (Quant+Attention integration) rewritten for
  7.3 / 7.3.5 outcome, Sprint Breakdown table rows added, dependency
  graph updated, Success Criterion 4 marked complete.
- docs/development/sprints/phase7/PHASE_7_LOG.md - 7.3.5 status row
  + ops-shipped table notes.
- CHANGELOG.md - Sprint 7.3.5 section under [Unreleased].
- README.md - INT8 row updated for S+1/2P; INT4 row explicitly notes
  Design S retained and reason.
- kaio-ops/src/qkv_project_int8_kernel.rs - module-level rustdoc
  refreshed (Tri-output design and Register budget sections) to
  reflect post-Rollback-#1 + Sprint 7.3.5 S+1/2P state.

Editorial polish pass across phase 1 / 4 / 6 / 7 sprint docs and
across kaio-ops / kaio test-file comments - tightening voice for
outward-facing readers. Technical rationale and measured results
unchanged.

.github/FUNDING.yml added for the Sponsor button - lands on main
via the phase7-ship merge.

Quality gates (cargo fmt --check, clippy --workspace --all-targets
-D warnings, doc --workspace --no-deps, test --workspace) all green.
ptxas_verify green on sm_80 + sm_89 for INT8 S+1/2P (64 / 56 regs,
0 spills) and INT4 Design S (56 / 56 regs, 0 spills).
5-shape abs-TOPS comparison, same session on RTX 4090 sm_89, kernel
file swapped between 2dbad72 (Design-S) and HEAD (S+1/2P) via
checkout-revert, identical thermal state both sides.

| shape            | Design-S | S+1/2P | speedup |
|------------------|----------|--------|---------|
| decode_m1        | 3.2      | 3.3    | 1.04x   |
| decode_m64       | 3.2      | 3.3    | 1.04x   |
| decode_m64_large | 9.8      | 10.2   | 1.04x   |
| prefill_m512     | 31.4     | 35.6   | 1.13x   |
| prefill_m2048    | 38.8     | 44.6   | 1.15x   |

S+1/2P wins across all 5 shapes. Gain scales with K-tile count per
launch: decode (few K-tiles, barriers not dominant) gains ~4%;
prefill_m2048 (256 K-tiles, barriers dominant in Design-S) gains
15% - exactly the threshold the sprint plan set for INT4, which
INT4 narrowly missed at 1.05x vs 3x standalone matmul_int4.

Different outcome between the variants at the same shape localises
the INT4 gap: the INT8 path recovers cleanly under barrier
reduction; the INT4 path (per-column f16 scale hoist + nibble-
extract dequant) exposes memory-issue latency as the new binding
constraint once barriers no longer dominate. cp.async targets that
class of stall and is the right candidate for a follow-up
contingency sprint - not as an alternative to S+1/2P but as the
next bottleneck after S+1/2P succeeds.

- sprint_7_3_5.md Performance section: INT8 table replaced with
  the like-for-like comparison + interpretation paragraph tying
  the INT8/INT4 asymmetry to the cp.async follow-up.
- PHASE_7_LOG.md 7.3.5 headline: added the 1.15x like-for-like
  result alongside the INT4 Measured-not-shipped status.
Moves cudarc's dynamic-loading / dynamic-linking choice out of the
workspace-level features list and into per-crate feature flags on
kaio-runtime, kaio-ops, and the kaio umbrella. Default stays on
dynamic-loading — standalone KAIO users see no behavior change.

Prep for the kaio-candle bridge: candle-core activates
cudarc/dynamic-linking, and cudarc's build.rs panics when both
loading strategies are simultaneously active in the feature-unified
dep graph. With this refactor, kaio-candle consumes the KAIO crates
via default-features = false, features = ["dynamic-linking"] — both
sides converge on dynamic-linking in the kaio-candle build graph,
no collision.

kaio-ops dev-dep on cudarc consolidated onto the workspace pin with
features = ["cublas"] — loading strategy now driven uniformly by
the crate feature block rather than duplicated in the dev-dep line.
Follow-on fix to the cudarc loading-strategy feature-gate refactor
(b0213c2). Without `default-features = false` on kaio's kaio-runtime
dep, kaio-runtime's default (dynamic-loading) stays active when
downstream consumers opt into dynamic-linking, defeating the opt-in
and triggering cudarc's dual-loading-strategy panic in the unified
dep graph.

Verified: candle-probe scratch crate pulls kaio + kaio-ops with
default-features = false, features = ["dynamic-linking"] alongside
candle-core's cuda feature. cargo check succeeds, cudarc builds
once with only dynamic-linking active. G3-1 resolved.
Two parts:

1. kaio-runtime: additive bridge-facing API on GpuBuffer<T>.
   - #[repr(transparent)] attr with load-bearing docstring; compile-time
     size+align asserts for f32/f16/i8/u32 via static_assertions dev-dep.
   - pub fn from_cuda_slice + pub fn into_cuda_slice replace the old
     pub(crate) from_raw (device.rs callers updated).
   - Layout-invariant in place so the kaio-candle bridge can transmute
     &CudaSlice<T> to &GpuBuffer<T> for pass-through to kaio-ops without
     round-tripping through an owned clone.

2. kaio-candle: new standalone crate at kaio-candle/, excluded from the
   main workspace (cudarc dynamic-loading vs dynamic-linking mutual
   exclusion — see Sprint 7.4a plan Pre-flight P2 / G3-1).
   - Cargo.toml: package metadata, optional kaio + kaio-ops + candle-core
     deps gated behind `cuda` feature.
   - src/lib.rs: crate docs, feature-gated module declarations.
   - src/{bridge,matmul_tc,matmul_tc_async,matmul_int4,attention_tc}.rs:
     stub modules; content lands D2-D5.

Gates (all green):
- Main workspace: fmt + clippy -D warnings + tests (static_assertions
  fire at compile time in kaio-runtime lib tests).
- Standalone kaio-candle: fmt + clippy --features cuda + clippy
  --no-default-features + cargo check on both feature paths + doc on
  both feature paths. Ship gate #10 (no-cuda build succeeds on
  no-toolkit host) satisfied — no compile_error! on the default path;
  bridge surface silently absent when `cuda` feature off.

.gitignore: kaio-candle/target/ + kaio-candle/Cargo.lock (standalone
library crate convention).
Adds the crate-private plumbing that D3-D5 per-op CustomOp impls
will compose:

- slice_ref_from_storage / storage_from_slice: cudarc CudaSlice<T>
  round-trip across candle's CudaDType trait (as_cuda_slice /
  wrap_cuda_slice).
- buffer_ref_from_slice_readonly: #[repr(transparent)] cast from
  &CudaSlice<T> to &GpuBuffer<T> with full Codex R1 aliasing contract
  + Gemini G3-2 lifetime invariant docstring.
- ensure_ordinal_match: AD1 enforcement via candle DeviceLocation
  pattern match + KaioDevice::ordinal() (new public getter; P4
  preflight showed ordinal equality + primary-context sharing is
  the right mechanism, not Arc identity).
- sync_before_launch / sync_after_launch: AD9 stream-safety fences
  via BackendDevice::synchronize. CUDA Graph limitation noted per
  Gemini G3-3.
- ensure_rank2_contiguous_zero_offset: AD4 input gate rejecting
  non-rank-2, non-contiguous, and non-zero-offset layouts with
  concrete reshape hints per Opus #6 + #7 and Codex R3.
- kaio_err: KaioError -> candle_core::Error helper (orphan rule
  prevents impl From).

cudarc promoted from dev-dep to regular optional dep (gated on the
cuda feature) so the bridge can name CudaSlice<T> directly.
KaioDevice gets a pub fn ordinal() getter (additive, non-breaking).

#![allow(dead_code)] on bridge.rs while per-op modules catch up.
Gates: cargo check + clippy -D warnings green on both feature paths.
MatmulTcOp (CustomOp2) + matmul_tc() wrapper fn and MatmulTcAsyncOp
(CustomOp2) + matmul_tc_async() wrapper fn, both following the
AD-spec'd flow:
  1. AD4 shape gate (rank-2 + contiguous + zero-offset) per input.
  2. K-mismatch check across the two inputs.
  3. AD1 ordinal equality via ensure_ordinal_match.
  4. Dtype gate via bridge::slice_ref_from_storage::<f16>.
  5. AD2-audited read-only shared borrow into candle-owned buffers.
  6. Bridge-owned f32 output allocation via kaio_device.alloc_zeros.
  7. AD9 pre-launch sync fence.
  8. kaio-ops kernel call.
  9. AD9 post-launch sync fence.
 10. Output consumed back to CudaStorage via storage_from_slice.

cpu_fwd on both ops returns a loud error; Q-CPU-Fallback in the plan
resolves to "reject over silent candle.matmul fallback."

Tests:
- 4 host tests inline in bridge.rs for shape-gate happy path +
  rank-1/3/4 rejection with concrete reshape hints.
- 8 GPU integration tests in tests/candle_gpu_roundtrip.rs:
  - 3 matmul_tc bit-exact cross-checks (64x64x64, 256x256x256,
    1024x1024x1024) between bridge and raw kaio-ops calls on
    identical host-side bits.
  - 3 matmul_tc_async bit-exact cross-checks, same shapes.
  - 2 rejection-path tests exercising the contiguity (.t()) and
    non-zero-offset (.narrow(..)) paths from Codex R3 / Opus #7.
  All #[ignore]-gated per kaio-ops convention; run with
  `cargo test --features cuda -- --ignored` on a GPU host.

Gates:
- Main workspace: fmt + clippy -D warnings + lib tests green.
- Standalone kaio-candle on cuda feature: fmt + clippy + cargo
  check + doc + 4 host tests + 8 GPU ignored tests ALL PASS BIT-
  EXACT on RTX 4090 sm_89.
- Standalone kaio-candle on no-default-features: fmt + clippy +
  cargo check + doc (empty-shell build).

New dev-dep: anyhow = "1" for terse test error handling across
candle + cudarc + kaio error types.
MatmulInt4Op (CustomOp3) + matmul_int4() wrapper for GPTQ-style INT4
dequantize-matmul (W4A16). group_size locked at 128 per kaio-ops
contract; K must be multiple of 128.

Shape contract validated against kaio_ops::matmul_int4:
  a:        f16 [M, K]
  b_packed: u32 [K/8, N]   (8 INT4 per u32)
  scales:   f16 [K/128, N]
  out:      f32 [M, N]

Bridge enforces the shape implications explicitly:
  - packed_rows == K/8, scale_rows == K/GROUP_SIZE, n_b == n_s.
  - Each with a concrete error message that tells the user exactly
    which derived dimension is off.

AD2-Audit D4: kaio-ops kernel is read-only on A / B_packed / scales
(validate_dims_int4 + dequant-fused inner loop reads through shared
memory stagings; inputs never mutated). Same readonly-transmute
contract as D3.

Tests:
- 2 bit-exact cross-check tests (256x256x128, 1024x1024x512).
- 1 rejection test for K-not-multiple-of-128.
- All #[ignore]-gated; run via cargo test --features cuda --
  --ignored.

Full 11 GPU integration tests pass on RTX 4090 sm_89:
  8 matmul_tc / matmul_tc_async (D3) + 3 matmul_int4 (D4).
Host tests unchanged at 4 (bridge shape gate).

Also swapped `k % 128 == 0` to `.is_multiple_of(128)` per
clippy::manual_is_multiple_of lint — stabilized helper.
…dings

Single AttentionTcOp (CustomOp3) struct with a `causal: bool` field;
two user-facing wrappers attention_tc() and attention_tc_causal() that
flip the flag and select kernel variant. name() returns distinct
kaio::attention_tc / kaio::attention_tc_causal strings so candle's
module cache keys don't collide.

Shape contract (P5):
  q:   f16 [seq_q, d_k]
  k:   f16 [seq_k, d_k]
  v:   f16 [seq_k, d_v]
  out: f32 [seq_q, d_v]
Bridge explicitly cross-checks d_k (Q vs K) and seq_k (K vs V) with
targeted error messages.

kaio-ops attention_tc has a seq_k ≤ 384 shared-memory cap (noted in
the kernel's error message; FlashAttention-TC lifts this in a later
sprint). Integration tests use 256x128 to fit cleanly.

AD2-Audit D5: attention_tc / attention_tc_causal kernels are
read-only on Q/K/V — validate_dims_attn + Q·Kᵀ/softmax/P·V all READ
from shared stagings + fragments, inputs never mutated.

Tests:
- 4 bit-exact cross-check tests:
  - attention_tc_bit_exact_64x64 / 256x128
  - attention_tc_causal_bit_exact_64x64 / 256x128
- All #[ignore]-gated; run via cargo test --features cuda --
  --ignored.

Full 15 GPU integration tests pass bit-exact on RTX 4090 sm_89:
  3 matmul_tc + 3 matmul_tc_async + 2 matmul_int4
  + 2 attention_tc + 2 attention_tc_causal
  + 2 matmul_tc rejection + 1 matmul_int4 rejection.
Host tests: 4 bridge shape-gate tests.
Main workspace gates + kaio-candle gates all green on both
feature paths.
README covers standalone-crate rationale, build, quickstart, op surface,
device lifetime, candle pin policy, and the v0.1 limitations list.
examples/matmul_tc_candle.rs and examples/attention_tc_candle.rs are
runnable with small shapes; ship-gate #14 satisfied (non-NaN output on
RTX 4090 sm_89). Cargo.toml include list extended to publish examples.
.github/workflows/candle-head.yml builds kaio-candle against
candle-core's git main once per Monday. Step-level continue-on-error
plus peter-evans/create-issue-from-file@v5 on failure keeps upstream
breakage visible without blocking other CI. README gets a workflow
badge and a candle-version-policy paragraph pointing at the job.
sprint_7_4a.md writes up the forward-only kaio-candle bridge —
context, scope, architectural decisions (standalone crate rationale,
explicit KaioDevice, repr(transparent) aliasing contract, AD4 input
gate, cuCtxSynchronize fences), results, and follow-ups (7.4b / 7.4c).
PHASE_7_LOG gets a 7.4a row plus a candle-bridge ops table.
CHANGELOG Unreleased gains a Sprint 7.4a section at the top.
Root README notes kaio-candle in the crate table, flips the
"inference only" bullet to point at the forward bridge, and expands
the Phase 7 roadmap line to mention 7.4a / 7.4b / 7.4c.
Sixth forward CustomOp binding in kaio-candle: W8A8 symmetric-quant
matmul via CustomOp2 with a scalar f32 scale threaded through the op
struct (same pattern as AttentionTcOp::causal).

Candle has no DType::I8, so the candle-side convention is DType::U8
tensors whose bytes are interpreted as signed INT8. The bridge
reinterprets &CudaSlice<u8> as &CudaSlice<i8> via a metadata-only
same-layout transmute (cudarc's CudaSlice<T> carries T only as
PhantomData, u8 and i8 are 1-byte identical).

5 new GPU integration tests - 3 bit-exact cross-checks at 256/1024/4096
cubed with scale values 0.00125 / 1.0 / 47.3 (spread across regimes so
a dropped-scale bug surfaces obviously-wrong in at least two of the
three tests), 2 rejection tests. Brings total kaio-candle GPU tests
to 20. All green on RTX 4090 sm_89.

Zero new bridge primitives - every helper reused from 7.4a. Only the
small reinterpret_u8_slice_as_i8 helper lives local to the op file.
New direct-call bridge pattern for ops that exceed candle's CustomOpN
arity (max 3 inputs, single output). Free functions extract CudaStorage
from &Tensor via storage_and_layout(), validate with named-parameter
error messages, call the fused kernel, and return (Tensor, Tensor, Tensor)
via Tensor::from_storage with BackpropOp::none().

qkv_project_int8: W8A16, 4 tensor + 3 scalar inputs, 3 f16 outputs.
qkv_project_int4: W4A16, 7 tensor inputs, 3 f16 outputs, group_size=128.

Output is DType::F16 (kernel does f32->f16 internally), unlike the
CustomOp-based ops which return f32. Gradient-tracked inputs rejected
with loud .detach() error to prevent silent autograd disconnection.

12 new GPU tests (5 bit-exact + 2 canaries + 5 rejection), 32 total.
All green on RTX 4090 sm_89.
Replaced cuCtxSynchronize fences with event-based cross-stream sync
via cudarc's CudaStream::join() (cuEventRecord + cuStreamWaitEvent
internally). GPU-side only, no CPU blocking.

sync_before_launch: KAIO stream joins candle stream (waits for
candle's prior work before kernel launch).
sync_after_launch: candle stream joins KAIO stream (waits for kernel
completion before candle uses output).

Both functions gain a &KaioDevice parameter; 7 call sites updated
across all op modules. CUDA Graph capture partially unblocked -
the cuCtxSynchronize blocker is removed, but default-stream capture
is still banned by CUDA itself. Full Graph capture requires
non-default streams on both sides (unverified).

33 GPU tests pass (32 bit-exact + 1 sync smoke test).
@dmriding dmriding merged commit c5ed152 into main Apr 17, 2026
4 checks passed
@dmriding dmriding mentioned this pull request Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant