Phase7 wrap#14
Merged
Merged
Conversation
…y byte-identical)
… builtins + whole-warp-multiple guard
Bridge sprint between quant milestones. Adds the warp-level reduction primitives that close the single biggest DSL expressiveness gap after matmul (softmax / layer-norm / RMS-norm / loss functions are now writable in pure DSL without manual shfl.sync ladders), plus the missing min variants at both warp and block level. Ships on the phase7-rest branch with no independent crates.io release; Phase 7 closes with an aggregate release after 7.2, 7.3, 7.4. New DSL builtins (f32-only this sprint; integer and half-precision variants are a clean follow-up): - warp_reduce_sum/max/min: per-warp reduction via butterfly shuffle tree, every lane in the warp gets the full result in 5 rounds. - block_reduce_min: completes the sum/max/min trio at block level. Whole-warp-multiple compile-time guard: kernels with block sizes whose total thread count is not a multiple of 32 (1D: 16; 2D: (4,4) or (8,6)) fail to compile with a span pointing at the warp_reduce_* call. Catches the partial-warp shfl.sync UB statically. Covered by 4 trybuild compile-fail fixtures. Rustdoc documents the per-warp semantic clearly, the convergent- control-flow requirement (data-dependent branches are UB the macro cannot detect), and the identity-value padding pattern for ragged boundaries. Helper emit_warp_tree_reduce factored from the existing block_reduce path, parameterized by shuffle variant (down for block_reduce lane-0- only, bfly for warp_reduce all-lanes), combine op, and PTX type. Leaves a clean seam for future i32/f16/bf16 reductions. Pre-refactor TokenStream snapshot canary locks the D2 refactor to byte-identical PTX; snapshot canary remains green. Tests: 12 new #[ignore]d GPU tests in reduce_macro.rs pass bit-exact on sm_89 including a 64-thread two-warp independence canary (warp 0 and warp 1 fed different patterns; asserts each warp gets its own reduction and the lowering does not blend across warps). 4 new trybuild compile-fail fixtures for the guard and the f32-type boundary. Host tests 330 pass 199 ignored (up from 187 baseline = 12 new GPU tests). No regression on showcases or matmul benches.
… A, B-packed, scales
…t4 module (64 regs/thread, 3.2 KB smem)
…89, max_rel 4e-6)
…atmul showcase + xtask integration
… xtask-bench runs (no tc/int8 regressions)
…fied tile constants
…fy sm_80 + sm_89)
…, 0 spills, sm_80/89)
…6.s8 + 2 MovPack)
…s/call + frag_c in-place canary)
…80 regs (Rollback #1 next) Full INT8 W8A16 tri-output kernel: 13 params, 2 shared decls (tile_x 2048B + tile_w 512B), K-loop with 7-barrier Design-S cadence, 12 mma.sync m16n8k16.f16.f16.f32 per K-tile across 3 fragment-C grids, store-out epilogue with 24 packed b32 stores + 48 cvt.rn.f16.f32 + 48 mul.f32 (scalar scale fold per projection). ptxas (sm_80 + sm_89): 0 spills, valid PTX, but reports 80 registers per thread. The extra 16-40 regs over the D2.5 skeleton baseline (32-40) come from cooperative-loader intermediate state, fragment-A reload regs per mma, and per-K-tile address-math intermediates that the skeleton did not model. 80 > the 64-reg full-occupancy cliff. Plan Rollback #1 trigger fires - drop MMAS_PER_WARP_N from 2 to 1, halving the per-block output tile to 64x16. That rollback lands in the next commit.
…_80/89, 0 spills) ptxas_verify gate now PASSED both targets: sm_80: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff) sm_89: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff) Rollback #1 was authorized in the Sprint 7.3 plan as the explicit recovery for "register pressure overshoot at D2.5 OR D3 full kernel". D3.4 hit 80 regs/0 spills both targets at the original 64x32 output tile; halving MMAS_PER_WARP_N reclaims 24 frag_c regs/lane (3 grids x 4 frag_d regs x 2 fewer sub-tiles per warp) and drops the post-rollback measurement to 48 regs/lane. Per-block output tile: 64x32 -> 64x16. More blocks launch, each block lighter. X-reuse economics still favor the fused tri-output design over 3x standalone matmul_int8; the absolute-vs-relative bench tier in D8 will quantify the trade. Implementation: - BN_BLOCK 32 -> 16, WARP_QUAD_N 16 -> 8, MMAS_PER_WARP_N 2 -> 1 - TILE_W_BYTES kept at 512 (col-stride padded to 32) so cooperative pre-zero keeps its single-issue-per-thread shape; W loader gates writes on col_start < BN_BLOCK so only the first 16 cols receive real data - Test counts updated to derive from MMAS_PER_WARP_* rather than hard-code the 64x32 numbers (future-proof for further rollback iterations)
…7.3 MVS ship point) Wires `kaio_ops::qkv_project_int8` (W8A16: f16 activations × i8 weights, scalar per-projection scales) as the Sprint 7.3 minimum-viable-sprint deliverable. 13-param signature: device, x_f16, w_q/k/v_i8, scale_q/k/v, q/k/v_out_f16, m/n/k. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1), block (32,4,1). Rustdoc spells out W8A16-vs-W8A8 explicitly in the first paragraph (Codex round 2 discipline) so callers don't conflate this with matmul_int8's W8A8 contract; points users needing W8A8 at three matmul_int8 calls + their own activation quant. GPU launch smoke test added: zero inputs → zero outputs canary on canonical shape (M=64, N=16, K=16). Verifies module loads + kernel launches + writes outputs without driver error. Correctness vs reference is the D7 e2e suite's job. Quality gates GREEN at the MVS ship point: - cargo fmt clean - cargo clippy -D warnings clean - 73 host tests pass (up from 72 with the launch smoke) - cargo doc clean (no broken intra-doc links) - ptxas_verify PASSED sm_80 + sm_89 (48 regs, 0 spills) Continuing to D5 (INT4 contingent) - D2.5 + D3.4 stayed clean post-rollback.
…, 0 spills)
W4A16 tri-output INT4 module: f16 activations × 3 packed-INT4 weight tensors ×
3 group-scale tensors → 3 f16 outputs. Single kernel launch with Design-S
serial fusion across Q/K/V and group-scale reload cadence (group_size=128,
8 K-tiles per group).
Reuses matmul_int4 helpers via super::matmul_int4_kernel:: path
(emit_mw_load_tile_a_f16_64x16, emit_unpack_s4_x2_scale_to_f16_pair,
emit_fragment_b_int4_per_lane). Made the latter two pub(crate) for
cross-kernel sharing.
New helpers in qkv_project_int4_kernel.rs:
- emit_mw_load_tile_w_packed_int4_2x16 (16-col packed-INT4 cooperative load)
- emit_cooperative_load_group_scales_int4 (16-cell f16 scale load, 8 active threads)
- emit_pre_zero_shared_tiles_qkv_int4 (3-tile pre-zero with predicated scales path)
- emit_warp_quadrant_mma_int4_per_projection (2x1 mma sweep with INT4 dequant feed)
- build_qkv_project_int4_module (full IR module - 13 params, 3 shared decls,
K-loop with 7-barrier cadence + group-boundary scale reload)
Same Rollback #1 as INT8 path (BN_BLOCK=16, MMAS_PER_WARP_N=1) - applied
preemptively since INT4 has identical fragment-C live state plus extra
register pressure from group-scale + nibble-extract chains. 56 regs/thread
vs INT8's 48 (the 8-reg gap is the dequant + scale-reload state).
Quality gates GREEN:
- cargo fmt clean
- cargo clippy -D warnings clean
- 83 host tests pass (up from 73 - added 11 D5 emit/module/canary tests)
- ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills, 8-reg headroom under
the 64-reg cliff)
- disjoint-register canary (24 distinct mma destination regs across 3 grids)
D6 next: public qkv_project_int4 host API (contingent ship).
…7.3 contingent ship) Wires `kaio_ops::qkv_project_int4` (W4A16: f16 activations × packed-INT4 weights × f16 group scales) as Sprint 7.3's contingent second deliverable. 14-param signature: device, x_f16, w_q/k/v_packed_u32, scales_q/k/v_f16, q/k/v_out_f16, m/n/k, group_size. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1). Rustdoc spells out the packing convention (8 nibbles per u32, K-contiguous, col-major [K/8, N]) and group-scale layout ([K/group_size, N] row-major, group_size=128 fixed). Cross-links to matmul_int4 since both share the convention. Explicit "not a drop-in for AutoGPTQ/exllama/GGUF" disclaimer. GPU launch smoke test: zero inputs → zero outputs canary on canonical shape (M=64, N=16, K=128 = exactly one group). Verifies module loads + kernel launches. Quality gates GREEN: - cargo fmt + clippy -D warnings clean - 84 host tests pass (+1 D6 smoke) - cargo doc clean - ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills) - already verified at D5 Both INT8 (D4 MVS) and INT4 (D6 contingent) public APIs now shipped on phase7-rest. D7 (GPU e2e + Q/K/V differentiation canary) next.
…pass on sm_89)
Two new test files with CPU-reference round-trip correctness + canary tests:
kaio-ops/tests/qkv_project_int8_e2e.rs (7 tests):
- 6 correctness tests across canonical shapes (smallest / one-block / multi-N /
multi-M / larger-K / medium 256x128x256)
- 1 Q/K/V differentiation canary (W_Q=+1, W_K=+2, W_V=+3, X=1, scale=1/K
→ outputs 1, 2, 3 per cell; catches fragment-C grid aliasing)
kaio-ops/tests/qkv_project_int4_e2e.rs (8 tests):
- 3 group-boundary shapes (1 group / 2 groups / 8 groups) exercising the
group_idx = k_tile / K_TILE_GROUP_RATIO cadence
- 3 multi-block shapes (multi-N / multi-M / larger)
- 1 sign-extend canary (all weights = -8, output should be -K*8 per cell;
catches shr.s32 → shr.u32 collapse)
- 1 Q/K/V differentiation canary
All 7 + 8 = 15 e2e tests PASSED on RTX 4090 sm_89. Tolerance gate
max_rel_err < 1e-3 (loose enough for f16-rounding-step difference between
kernel and reference; observed max_rel stays under 3e-4 in practice).
TWO BUGS CAUGHT + FIXED in D7:
1. INT8 W loader double-counted block_col
The emit_mw_load_tile_w_int8_16x32 loader added block_col to its global
offset computation, but w_block_base_global already carried the
block_col byte shift from build_qkv_project_int8_module. Result: every
non-zero N-block computed from WRONG global W addresses, corrupting all
multi-N-block shapes. Single-N-block tests passed because block_col=0
makes the double-count a no-op. Fix: use col_start (within-tile offset)
for the global addr computation instead of col_global.
2. INT4 shared scales slot raced across projections
build_qkv_project_int4_module originally reloaded scales only on group
boundaries (every 8 K-tiles) but the single tile_scales slot was reused
across Q/K/V per K-tile. After mma_V on K-tile 0, the slot held
scales_V; on K-tile 1 (non-boundary) all three projections' mmas read
scales_V instead of their own scales. The sign-extend + Q/K/V canaries
passed because they use constant scales, but random-data correctness
tests failed. Fix: reload scales every K-tile, unconditionally (32 B per
projection per K-tile = 96 B/K-tile additional global bandwidth,
negligible vs 384 B/K-tile for the W loads). Group-boundary-only
optimization deferred to future bench-driven work with three separate
scales slots.
Quality gates GREEN:
- cargo fmt + clippy -D warnings clean (factored QkvOutTuple alias for
e2e round_trip return types)
- 84 host tests pass
- 15 GPU e2e tests PASS on sm_89
- ptxas_verify INT8: 48 regs, INT4: 56 regs, 0 spills both variants
- Both kernels validated across sm_80 + sm_89 (W-loader fix did not
change register count)
…iers)
Sprint 7.3 D8. Compares fused qkv_project_int4 vs 3x sequential matmul_int4
(clean W4A16 apples-to-apples baseline), and reports absolute TOPS for
qkv_project_int8 (W8A16 has no fair standalone baseline - matmul_int8 is W8A8).
Shape sweep: decode tier (M ∈ {1, 64}, K/N ∈ {2048, 4096}) + prefill tier
(M ∈ {512, 2048}, K/N = 4096). Plus 4 diagnostic shapes around the initial
decode_m64 K/N=2048 anomaly to confirm it was measurement noise, not a kernel
bug.
Stable results on RTX 4090 sm_89 (release mode):
Decode tier (M ≤ 128): fused ~3.0x baseline, consistent across shapes
prefill_m512 (M=512): 1.19x (above ship threshold 1.15x)
prefill_m2048 (M=2048): 0.85x (below 1.00x - fused loses 15% at largest prefill)
Ship-B call (maintainer 2026-04-16): ship 7.3 with honest framing
("fused wins big at decode, ties/loses at prefill - use 3x standalone for
prefill-heavy workloads"), plan Design S+1/2P optimization as Sprint 7.3.5
with full review cadence. Target for 7.3.5: recover prefill_m2048 from 0.85x
to ≥1.15x by overlapping W_{P+1} load with mma_P (barriers 7 → 4 per K-tile).
Outlier investigation (decode_m64 K/N=2048 reported 0.40x in first run):
NOT reproducible on re-run. Diagnostic sweep across 4 nearby shapes shows
consistent 3.0x throughout - confirmed transient measurement noise in the
first run (likely first-call CUDA context warmup bleeding through despite
5-iter warmup loop). No kernel bug.
Absolute TOPS numbers intentionally NOT highlighted - repeated back-to-back
bench runs cause thermal throttling on the 4090 that crushes absolute
throughput by 50-100x for all workloads including cuBLAS sgemm reference.
Ratios are throttle-invariant (fused + baseline throttle equally) so ratio
numbers remain the valid headline. Clean absolute-TOPS run deferred to D10
doc with cold-system methodology.
…ttention_tc) End-to-end Phase 7 pipeline demo. GPTQ-lite quantizes random f32 weights into KAIO's packed-INT4 + group-scale layout, runs fused qkv_project_int4 to produce three f16 outputs, feeds them to attention_tc, and compares against an f16 reference path (3x matmul_tc → attention_tc). Reports three quality metrics per the Sprint 7.3 plan D9: - cosine similarity (primary pass/fail, plan threshold 0.98 INT4) - max absolute error (worst-case row outlier) - mean relative error (aggregate quality) Plus a projection-stage cosine for localization - distinguishes "kernel bug" from "softmax amplified a small projection error" when the final number is below threshold. Showcase shape: SEQ=64, D_MODEL=128, D_HEAD=64, GROUP_SIZE=128 (decode-tier shape where fusion wins cleanly per D8 bench). Observed on RTX 4090 sm_89 with random f16 weights: - Projection Q cos_sim = 0.9975 (kernel working correctly) - Final attention cos_sim = 0.9655 (below 0.98 plan threshold) README + in-example printout frame this honestly per the plan: "Random f16 weights are a worst case for group-scale quantization fidelity. Real trained LLM weights have much tighter group statistics and land measurably better than these synthetic numbers." The kernel's job is correctly performing the quantized projection; softmax amplifies any projection-stage error non-linearly in the final output. This is expected behavior for random weights and will improve substantially on real model weights where per-group magnitude variance is much lower. Wired into `cargo xtask showcase qkvattn`.
…7.3.5 plan stub
Completes Sprint 7.3 documentation pass:
- docs/development/sprints/phase7/sprint_7_3.md - full decision log (D1-D10),
bench table (all 9 shapes, no cherry-picking), architectural decisions
(AD1-AD5), bug-catch trail (cvt.rn.f16.s8, INT8 W-loader double-count,
INT4 scales-slot race), and the maintainer Ship-B decision.
- docs/development/sprints/phase7/PHASE_7_LOG.md - new index for Phase 7
(Phase 6 has one, Phase 7 didn't); populated with 7.0 through 7.3 + the
7.3.5 / 7.4 planning rows.
- docs/development/sprints/phase7/sprint_7_3_5.md - stub plan for Design
S+1/2P (two W slots ping-pong, barriers 7→4). Success criterion pinned:
prefill_m2048 from 0.85x to ≥1.15x on sm_89. Full reviewer cadence
required (synchronization semantics change in inner loop); not a hotfix.
- CHANGELOG.md [Unreleased] block - Sprint 7.3 entry covering both public
APIs, the shared store-out helper, the kaio-core cvt.rn fix, the
showcase example, and the honest decode-wins-prefill-loses framing.
- README.md ops table - two new rows for qkv_project_int{4,8}.
Quality gates at D10 sprint-close:
- 84 kaio-ops host tests pass
- 15 GPU e2e tests pass on sm_89 (7 INT8 + 8 INT4)
- ptxas_verify PASSED sm_80 + sm_89 for both kernels + skeleton + store_out
- cargo doc --no-deps clean
- cargo fmt + clippy -D warnings clean across workspace
…s / sm_89 40 regs / 0 spills) Port qkv_skeleton.rs from 7.3 Design-S to 7.3.5 S+1/2P peak-pressure model: - 2 tile_w shared slots with 64B bank-phase padding (stride 576B, non-multiple of 128 per Design invariant #4, Gemini R3-1) - 3 unrolled K-iterations with frag_C back-edge state threaded across iterations (Opus R1b-2 / R3b-3 cross-iteration modelling requirement - forces ptxas to count frag_C at peak, not elide) - Runtime ctaid.x-indexed slot pointers for ping-pong (ptxas can't constant-fold slot selection - Design invariant #3) - Scales register-hoist modelled via runtime k_tile_group counter (6 regs total, Pre-review clarification #2) - In-flight cooperative-load address registers kept live across the mma epoch to model the overlap window D1 gate PASSED on both architectures with comfortable headroom: sm_80: 32 registers, 0 spills, 3136 B smem, 404 B cmem[0] sm_89: 40 registers, 0 spills, 3136 B smem, 404 B cmem[0] cmem[0] = 404 B (kernel params in constant bank 0) confirms the R3b-2 case (a) upfront: kernel args arrive uniform-broadcast- friendly by default, no GPR pressure from base pointers, escape hatch has no handle to grab. SASS inspection for the escape hatch is not needed at the skeleton stage. Rollback #2 not triggered. Proceeding to D2 (INT8 S+1/2P body). Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p kaio-ops --lib (84 passed, 7 ignored). Shared budget 3136 B matches plan (2048 tile_x + 512 slot0 + 64 pad + 512 slot1).
…9 56 regs / 0 spills)
Port qkv_project_int8_kernel build function from 7.3 Design-S to
7.3.5 S+1/2P. Design-S helper emit_warp_quadrant_mma_int8_per_projection
stays untouched (preserves 7.3 tests + callers); two new siblings
added for the S+1/2P path:
emit_warp_quadrant_mma_int8_per_projection_hoisted
Per-projection mma sweep that reads frag_A from caller-provided
pre-hoisted registers instead of reloading from tile_x. Required
so the three projection mmas share a single K-tile-start frag_A
hoist, making Design invariant #1 (tile_x overwrite-after-hoist)
expressible.
hoist_frag_as_for_warp_quadrant
Loads both m_stripe frag_As from tile_x at K-tile start. The
caller is contractually required to fire B1 bar.sync AFTER this
call and BEFORE any tile_x write.
emit_pre_zero_shared_tiles_qkv_int8_sp_half_p
3-slot pre-zero (tile_x + tile_w_slot0 + tile_w_slot1; the 64 B
pad between slots is write-only padding, never pre-zeroed).
Shared layout: tile_x (2048 B) + tile_w_slot0 (512 B) + tile_w_pad
(64 B, Design invariant #4 bank-phase) + tile_w_slot1 (512 B) =
3136 B. Stride slot0 -> slot1 = 576 B, non-multiple of 128 (no
cross-warp SMEM bank-port contention during overlap).
K-loop body 4-barrier cadence (plan worked-example table):
B1 (post-hoist) INVARIANT #1 — tile_x safe to overwrite
Epoch 1: load W_K(k) -> NEXT slot; mma_Q uses CUR slot
B2 (post-mma_Q) CUR slot safe to overwrite
Epoch 2: load W_V(k) -> CUR slot; mma_K uses NEXT slot
B3 (post-mma_K)
Epoch 3: [overlap loads -> W_Q(k+1) + X(k+1), skipped on final
K-tile via uniform @!p_not_last branch];
mma_V uses CUR slot
B4 (post-mma_V) closes K-tile epoch
Runtime ping-pong slot selection via and.b32 + setp.eq + selp.u32
(slot_idx = k_tile & 1; p_slot0 = slot_idx == 0; cur/next chosen
via selp against r_tile_w_slot{0,1}). Overlap-skip predicate
p_not_last = k_tile + 1 < num_k_tiles; uniform branch, no
divergence.
Total barrier count for 1-K-tile module = 2 setup (pre-zero + B0)
+ 4 per K-tile = 6. 7.3 Design S was 1 + 7 = 8. Per K-tile savings
3 barriers; scales with K.
ptxas --verbose results (S+1/2P, archived R2-5 artifacts):
sm_80: Used 64 registers, 0 bytes spill stores, 0 bytes spill loads
sm_89: Used 56 registers, 0 bytes spill stores, 0 bytes spill loads
vs 7.3 Design-S INT8 post-Rollback-#1 baseline (48 regs on both
architectures): sm_89 +8 regs (hoisted frag_A 2-stripes + slot
selection + overlap-load addr math), sm_80 +16 regs. sm_80 at the
64-reg ceiling exactly; sm_89 has 8 reg headroom. Both architectures
clear the D2 gate. Escape hatch (Gemini R3-4) not needed — ptxas
already placed base pointers in cmem[0] per D1 skeleton evidence.
Test updates:
- build_qkv_project_int8_module_structure now asserts 4 shared decls
(tile_x + slot0 + pad + slot1), new K-loop label
K_LOOP_QKV_INT8_SP_HALF_P, and OVERLAP_SKIP_QKV_INT8_SP_HALF_P label.
- build_qkv_project_int8_module_emits_seven_barriers_per_k_tile
renamed to _emits_four_barriers_per_k_tile_sp_half_p, expects 6
total bar.sync (was 8).
Quality gates green: cargo fmt, cargo clippy -D warnings, cargo test
-p kaio-ops --lib (84 passed, 7 ignored). D3 (GPU e2e regression +
slot-mapping canary + determinism stress) comes next.
…s, 100x bit-exact)
Add two test classes for the S+1/2P kernel beyond the 7.3 regression
suite:
- Slot-mapping canary (2 shapes): deliberately-distinguishable
weights per projection (W_Q=1, W_K=2, W_V=3, X=all-1, scale=1)
so expected output is exactly {K, 2K, 3K} per element. Catches
deterministic ping-pong mis-wiring that random-data tests can
pass by coincidence. Two K-tile variants:
- K=32 (2 K-tiles) exercises the steady -> final transition.
- K=48 (3 K-tiles) exercises steady -> steady -> final, covering
the back-edge transition that the 2-K-tile version doesn't hit.
- Determinism stress (2 shapes): runs the same input 100x back-to-
back in release build and asserts bit-exact output across all
runs (from run 1 - no warmup discards). Catches barrier-
misplacement races that manifest as cross-launch nondeterminism
under varying warp scheduling.
- Short: M=256, N=512, K=1024 (64 K-tiles per launch).
- Prefill: M=2048, N=512, K=4096 (256 K-tiles, the regime that
motivated the S+1/2P rework).
Results on RTX 4090 sm_89:
- 7 existing e2e regression tests: pass (max_rel <= 4.88e-4, most
bit-exact).
- Q/K/V differentiation canary: pass.
- Slot-mapping canary K=32: Q=32 K=64 V=96 exact.
- Slot-mapping canary K=48: Q=48 K=96 V=144 exact.
- Determinism short 100x: bit-exact.
- Determinism prefill 100x: bit-exact.
11/11 INT8 GPU e2e tests pass. No kernel code changed in D3 - only
tests added.
Diagnostic in ptxas_verify: also report allocation at
-maxrregcount 128 to distinguish natural register allocation from
a ptxas compression-under-cap. Confirmed sm_80 64 regs is a genuine
register-reuse compression below the 66-reg natural allocation
(zero spills at default), not a hidden-spill pathology. sm_89 at
56 regs at default vs 66 natural - same pattern, larger headroom.
Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p
kaio-ops --lib (84 passed, 7 ignored).
…G.yml Sprint 7.3.5 closes with the asymmetric outcome the plan's 5-tier table pre-declared: INT8 ships Design S+1/2P (correctness clean, ptxas clean, abs TOPS baseline captured); INT4 S+1/2P was measured at median prefill_m2048 = 1.05x across 3 runs, improved from Design S's 0.85x but below the 1.15x plan-locked ship gate and the 1.10x ship-narrow floor - landed in "Measured, not shipped". INT4 retained at Design S on main; INT4 S+1/2P port archived on phase7-rest for potential future cp.async contingency work. Docs added / updated: - docs/development/sprints/phase7/sprint_7_3_5.md (new) - terse post-delivery outline with full measured-bench table, 5 architectural decisions, follow-up next steps. - docs/development/sprints/phase7/phase7_master_plan.md - status header, section 4 (Quant+Attention integration) rewritten for 7.3 / 7.3.5 outcome, Sprint Breakdown table rows added, dependency graph updated, Success Criterion 4 marked complete. - docs/development/sprints/phase7/PHASE_7_LOG.md - 7.3.5 status row + ops-shipped table notes. - CHANGELOG.md - Sprint 7.3.5 section under [Unreleased]. - README.md - INT8 row updated for S+1/2P; INT4 row explicitly notes Design S retained and reason. - kaio-ops/src/qkv_project_int8_kernel.rs - module-level rustdoc refreshed (Tri-output design and Register budget sections) to reflect post-Rollback-#1 + Sprint 7.3.5 S+1/2P state. Editorial polish pass across phase 1 / 4 / 6 / 7 sprint docs and across kaio-ops / kaio test-file comments - tightening voice for outward-facing readers. Technical rationale and measured results unchanged. .github/FUNDING.yml added for the Sponsor button - lands on main via the phase7-ship merge. Quality gates (cargo fmt --check, clippy --workspace --all-targets -D warnings, doc --workspace --no-deps, test --workspace) all green. ptxas_verify green on sm_80 + sm_89 for INT8 S+1/2P (64 / 56 regs, 0 spills) and INT4 Design S (56 / 56 regs, 0 spills).
5-shape abs-TOPS comparison, same session on RTX 4090 sm_89, kernel file swapped between 2dbad72 (Design-S) and HEAD (S+1/2P) via checkout-revert, identical thermal state both sides. | shape | Design-S | S+1/2P | speedup | |------------------|----------|--------|---------| | decode_m1 | 3.2 | 3.3 | 1.04x | | decode_m64 | 3.2 | 3.3 | 1.04x | | decode_m64_large | 9.8 | 10.2 | 1.04x | | prefill_m512 | 31.4 | 35.6 | 1.13x | | prefill_m2048 | 38.8 | 44.6 | 1.15x | S+1/2P wins across all 5 shapes. Gain scales with K-tile count per launch: decode (few K-tiles, barriers not dominant) gains ~4%; prefill_m2048 (256 K-tiles, barriers dominant in Design-S) gains 15% - exactly the threshold the sprint plan set for INT4, which INT4 narrowly missed at 1.05x vs 3x standalone matmul_int4. Different outcome between the variants at the same shape localises the INT4 gap: the INT8 path recovers cleanly under barrier reduction; the INT4 path (per-column f16 scale hoist + nibble- extract dequant) exposes memory-issue latency as the new binding constraint once barriers no longer dominate. cp.async targets that class of stall and is the right candidate for a follow-up contingency sprint - not as an alternative to S+1/2P but as the next bottleneck after S+1/2P succeeds. - sprint_7_3_5.md Performance section: INT8 table replaced with the like-for-like comparison + interpretation paragraph tying the INT8/INT4 asymmetry to the cp.async follow-up. - PHASE_7_LOG.md 7.3.5 headline: added the 1.15x like-for-like result alongside the INT4 Measured-not-shipped status.
Moves cudarc's dynamic-loading / dynamic-linking choice out of the workspace-level features list and into per-crate feature flags on kaio-runtime, kaio-ops, and the kaio umbrella. Default stays on dynamic-loading — standalone KAIO users see no behavior change. Prep for the kaio-candle bridge: candle-core activates cudarc/dynamic-linking, and cudarc's build.rs panics when both loading strategies are simultaneously active in the feature-unified dep graph. With this refactor, kaio-candle consumes the KAIO crates via default-features = false, features = ["dynamic-linking"] — both sides converge on dynamic-linking in the kaio-candle build graph, no collision. kaio-ops dev-dep on cudarc consolidated onto the workspace pin with features = ["cublas"] — loading strategy now driven uniformly by the crate feature block rather than duplicated in the dev-dep line.
Follow-on fix to the cudarc loading-strategy feature-gate refactor (b0213c2). Without `default-features = false` on kaio's kaio-runtime dep, kaio-runtime's default (dynamic-loading) stays active when downstream consumers opt into dynamic-linking, defeating the opt-in and triggering cudarc's dual-loading-strategy panic in the unified dep graph. Verified: candle-probe scratch crate pulls kaio + kaio-ops with default-features = false, features = ["dynamic-linking"] alongside candle-core's cuda feature. cargo check succeeds, cudarc builds once with only dynamic-linking active. G3-1 resolved.
Two parts:
1. kaio-runtime: additive bridge-facing API on GpuBuffer<T>.
- #[repr(transparent)] attr with load-bearing docstring; compile-time
size+align asserts for f32/f16/i8/u32 via static_assertions dev-dep.
- pub fn from_cuda_slice + pub fn into_cuda_slice replace the old
pub(crate) from_raw (device.rs callers updated).
- Layout-invariant in place so the kaio-candle bridge can transmute
&CudaSlice<T> to &GpuBuffer<T> for pass-through to kaio-ops without
round-tripping through an owned clone.
2. kaio-candle: new standalone crate at kaio-candle/, excluded from the
main workspace (cudarc dynamic-loading vs dynamic-linking mutual
exclusion — see Sprint 7.4a plan Pre-flight P2 / G3-1).
- Cargo.toml: package metadata, optional kaio + kaio-ops + candle-core
deps gated behind `cuda` feature.
- src/lib.rs: crate docs, feature-gated module declarations.
- src/{bridge,matmul_tc,matmul_tc_async,matmul_int4,attention_tc}.rs:
stub modules; content lands D2-D5.
Gates (all green):
- Main workspace: fmt + clippy -D warnings + tests (static_assertions
fire at compile time in kaio-runtime lib tests).
- Standalone kaio-candle: fmt + clippy --features cuda + clippy
--no-default-features + cargo check on both feature paths + doc on
both feature paths. Ship gate #10 (no-cuda build succeeds on
no-toolkit host) satisfied — no compile_error! on the default path;
bridge surface silently absent when `cuda` feature off.
.gitignore: kaio-candle/target/ + kaio-candle/Cargo.lock (standalone
library crate convention).
Adds the crate-private plumbing that D3-D5 per-op CustomOp impls will compose: - slice_ref_from_storage / storage_from_slice: cudarc CudaSlice<T> round-trip across candle's CudaDType trait (as_cuda_slice / wrap_cuda_slice). - buffer_ref_from_slice_readonly: #[repr(transparent)] cast from &CudaSlice<T> to &GpuBuffer<T> with full Codex R1 aliasing contract + Gemini G3-2 lifetime invariant docstring. - ensure_ordinal_match: AD1 enforcement via candle DeviceLocation pattern match + KaioDevice::ordinal() (new public getter; P4 preflight showed ordinal equality + primary-context sharing is the right mechanism, not Arc identity). - sync_before_launch / sync_after_launch: AD9 stream-safety fences via BackendDevice::synchronize. CUDA Graph limitation noted per Gemini G3-3. - ensure_rank2_contiguous_zero_offset: AD4 input gate rejecting non-rank-2, non-contiguous, and non-zero-offset layouts with concrete reshape hints per Opus #6 + #7 and Codex R3. - kaio_err: KaioError -> candle_core::Error helper (orphan rule prevents impl From). cudarc promoted from dev-dep to regular optional dep (gated on the cuda feature) so the bridge can name CudaSlice<T> directly. KaioDevice gets a pub fn ordinal() getter (additive, non-breaking). #![allow(dead_code)] on bridge.rs while per-op modules catch up. Gates: cargo check + clippy -D warnings green on both feature paths.
MatmulTcOp (CustomOp2) + matmul_tc() wrapper fn and MatmulTcAsyncOp
(CustomOp2) + matmul_tc_async() wrapper fn, both following the
AD-spec'd flow:
1. AD4 shape gate (rank-2 + contiguous + zero-offset) per input.
2. K-mismatch check across the two inputs.
3. AD1 ordinal equality via ensure_ordinal_match.
4. Dtype gate via bridge::slice_ref_from_storage::<f16>.
5. AD2-audited read-only shared borrow into candle-owned buffers.
6. Bridge-owned f32 output allocation via kaio_device.alloc_zeros.
7. AD9 pre-launch sync fence.
8. kaio-ops kernel call.
9. AD9 post-launch sync fence.
10. Output consumed back to CudaStorage via storage_from_slice.
cpu_fwd on both ops returns a loud error; Q-CPU-Fallback in the plan
resolves to "reject over silent candle.matmul fallback."
Tests:
- 4 host tests inline in bridge.rs for shape-gate happy path +
rank-1/3/4 rejection with concrete reshape hints.
- 8 GPU integration tests in tests/candle_gpu_roundtrip.rs:
- 3 matmul_tc bit-exact cross-checks (64x64x64, 256x256x256,
1024x1024x1024) between bridge and raw kaio-ops calls on
identical host-side bits.
- 3 matmul_tc_async bit-exact cross-checks, same shapes.
- 2 rejection-path tests exercising the contiguity (.t()) and
non-zero-offset (.narrow(..)) paths from Codex R3 / Opus #7.
All #[ignore]-gated per kaio-ops convention; run with
`cargo test --features cuda -- --ignored` on a GPU host.
Gates:
- Main workspace: fmt + clippy -D warnings + lib tests green.
- Standalone kaio-candle on cuda feature: fmt + clippy + cargo
check + doc + 4 host tests + 8 GPU ignored tests ALL PASS BIT-
EXACT on RTX 4090 sm_89.
- Standalone kaio-candle on no-default-features: fmt + clippy +
cargo check + doc (empty-shell build).
New dev-dep: anyhow = "1" for terse test error handling across
candle + cudarc + kaio error types.
MatmulInt4Op (CustomOp3) + matmul_int4() wrapper for GPTQ-style INT4
dequantize-matmul (W4A16). group_size locked at 128 per kaio-ops
contract; K must be multiple of 128.
Shape contract validated against kaio_ops::matmul_int4:
a: f16 [M, K]
b_packed: u32 [K/8, N] (8 INT4 per u32)
scales: f16 [K/128, N]
out: f32 [M, N]
Bridge enforces the shape implications explicitly:
- packed_rows == K/8, scale_rows == K/GROUP_SIZE, n_b == n_s.
- Each with a concrete error message that tells the user exactly
which derived dimension is off.
AD2-Audit D4: kaio-ops kernel is read-only on A / B_packed / scales
(validate_dims_int4 + dequant-fused inner loop reads through shared
memory stagings; inputs never mutated). Same readonly-transmute
contract as D3.
Tests:
- 2 bit-exact cross-check tests (256x256x128, 1024x1024x512).
- 1 rejection test for K-not-multiple-of-128.
- All #[ignore]-gated; run via cargo test --features cuda --
--ignored.
Full 11 GPU integration tests pass on RTX 4090 sm_89:
8 matmul_tc / matmul_tc_async (D3) + 3 matmul_int4 (D4).
Host tests unchanged at 4 (bridge shape gate).
Also swapped `k % 128 == 0` to `.is_multiple_of(128)` per
clippy::manual_is_multiple_of lint — stabilized helper.
…dings Single AttentionTcOp (CustomOp3) struct with a `causal: bool` field; two user-facing wrappers attention_tc() and attention_tc_causal() that flip the flag and select kernel variant. name() returns distinct kaio::attention_tc / kaio::attention_tc_causal strings so candle's module cache keys don't collide. Shape contract (P5): q: f16 [seq_q, d_k] k: f16 [seq_k, d_k] v: f16 [seq_k, d_v] out: f32 [seq_q, d_v] Bridge explicitly cross-checks d_k (Q vs K) and seq_k (K vs V) with targeted error messages. kaio-ops attention_tc has a seq_k ≤ 384 shared-memory cap (noted in the kernel's error message; FlashAttention-TC lifts this in a later sprint). Integration tests use 256x128 to fit cleanly. AD2-Audit D5: attention_tc / attention_tc_causal kernels are read-only on Q/K/V — validate_dims_attn + Q·Kᵀ/softmax/P·V all READ from shared stagings + fragments, inputs never mutated. Tests: - 4 bit-exact cross-check tests: - attention_tc_bit_exact_64x64 / 256x128 - attention_tc_causal_bit_exact_64x64 / 256x128 - All #[ignore]-gated; run via cargo test --features cuda -- --ignored. Full 15 GPU integration tests pass bit-exact on RTX 4090 sm_89: 3 matmul_tc + 3 matmul_tc_async + 2 matmul_int4 + 2 attention_tc + 2 attention_tc_causal + 2 matmul_tc rejection + 1 matmul_int4 rejection. Host tests: 4 bridge shape-gate tests. Main workspace gates + kaio-candle gates all green on both feature paths.
README covers standalone-crate rationale, build, quickstart, op surface, device lifetime, candle pin policy, and the v0.1 limitations list. examples/matmul_tc_candle.rs and examples/attention_tc_candle.rs are runnable with small shapes; ship-gate #14 satisfied (non-NaN output on RTX 4090 sm_89). Cargo.toml include list extended to publish examples.
.github/workflows/candle-head.yml builds kaio-candle against candle-core's git main once per Monday. Step-level continue-on-error plus peter-evans/create-issue-from-file@v5 on failure keeps upstream breakage visible without blocking other CI. README gets a workflow badge and a candle-version-policy paragraph pointing at the job.
sprint_7_4a.md writes up the forward-only kaio-candle bridge — context, scope, architectural decisions (standalone crate rationale, explicit KaioDevice, repr(transparent) aliasing contract, AD4 input gate, cuCtxSynchronize fences), results, and follow-ups (7.4b / 7.4c). PHASE_7_LOG gets a 7.4a row plus a candle-bridge ops table. CHANGELOG Unreleased gains a Sprint 7.4a section at the top. Root README notes kaio-candle in the crate table, flips the "inference only" bullet to point at the forward bridge, and expands the Phase 7 roadmap line to mention 7.4a / 7.4b / 7.4c.
Sixth forward CustomOp binding in kaio-candle: W8A8 symmetric-quant matmul via CustomOp2 with a scalar f32 scale threaded through the op struct (same pattern as AttentionTcOp::causal). Candle has no DType::I8, so the candle-side convention is DType::U8 tensors whose bytes are interpreted as signed INT8. The bridge reinterprets &CudaSlice<u8> as &CudaSlice<i8> via a metadata-only same-layout transmute (cudarc's CudaSlice<T> carries T only as PhantomData, u8 and i8 are 1-byte identical). 5 new GPU integration tests - 3 bit-exact cross-checks at 256/1024/4096 cubed with scale values 0.00125 / 1.0 / 47.3 (spread across regimes so a dropped-scale bug surfaces obviously-wrong in at least two of the three tests), 2 rejection tests. Brings total kaio-candle GPU tests to 20. All green on RTX 4090 sm_89. Zero new bridge primitives - every helper reused from 7.4a. Only the small reinterpret_u8_slice_as_i8 helper lives local to the op file.
New direct-call bridge pattern for ops that exceed candle's CustomOpN arity (max 3 inputs, single output). Free functions extract CudaStorage from &Tensor via storage_and_layout(), validate with named-parameter error messages, call the fused kernel, and return (Tensor, Tensor, Tensor) via Tensor::from_storage with BackpropOp::none(). qkv_project_int8: W8A16, 4 tensor + 3 scalar inputs, 3 f16 outputs. qkv_project_int4: W4A16, 7 tensor inputs, 3 f16 outputs, group_size=128. Output is DType::F16 (kernel does f32->f16 internally), unlike the CustomOp-based ops which return f32. Gradient-tracked inputs rejected with loud .detach() error to prevent silent autograd disconnection. 12 new GPU tests (5 bit-exact + 2 canaries + 5 rejection), 32 total. All green on RTX 4090 sm_89.
Replaced cuCtxSynchronize fences with event-based cross-stream sync via cudarc's CudaStream::join() (cuEventRecord + cuStreamWaitEvent internally). GPU-side only, no CPU blocking. sync_before_launch: KAIO stream joins candle stream (waits for candle's prior work before kernel launch). sync_after_launch: candle stream joins KAIO stream (waits for kernel completion before candle uses output). Both functions gain a &KaioDevice parameter; 7 call sites updated across all op modules. CUDA Graph capture partially unblocked - the cuCtxSynchronize blocker is removed, but default-stream capture is still banned by CUDA itself. Full Graph capture requires non-default streams on both sides (unverified). 33 GPU tests pass (32 bit-exact + 1 sync smoke test).
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.