Phase7 wrap by dmriding · Pull Request #14 · dmriding/kaio

dmriding · 2026-04-17T23:43:10Z

No description provided.

…arp_reduce)

…k_reduce_sum

…y byte-identical)

… builtins + whole-warp-multiple guard

…4 trybuild)

Bridge sprint between quant milestones. Adds the warp-level reduction primitives that close the single biggest DSL expressiveness gap after matmul (softmax / layer-norm / RMS-norm / loss functions are now writable in pure DSL without manual shfl.sync ladders), plus the missing min variants at both warp and block level. Ships on the phase7-rest branch with no independent crates.io release; Phase 7 closes with an aggregate release after 7.2, 7.3, 7.4. New DSL builtins (f32-only this sprint; integer and half-precision variants are a clean follow-up): - warp_reduce_sum/max/min: per-warp reduction via butterfly shuffle tree, every lane in the warp gets the full result in 5 rounds. - block_reduce_min: completes the sum/max/min trio at block level. Whole-warp-multiple compile-time guard: kernels with block sizes whose total thread count is not a multiple of 32 (1D: 16; 2D: (4,4) or (8,6)) fail to compile with a span pointing at the warp_reduce_* call. Catches the partial-warp shfl.sync UB statically. Covered by 4 trybuild compile-fail fixtures. Rustdoc documents the per-warp semantic clearly, the convergent- control-flow requirement (data-dependent branches are UB the macro cannot detect), and the identity-value padding pattern for ragged boundaries. Helper emit_warp_tree_reduce factored from the existing block_reduce path, parameterized by shuffle variant (down for block_reduce lane-0- only, bfly for warp_reduce all-lanes), combine op, and PTX type. Leaves a clean seam for future i32/f16/bf16 reductions. Pre-refactor TokenStream snapshot canary locks the D2 refactor to byte-identical PTX; snapshot canary remains green. Tests: 12 new #[ignore]d GPU tests in reduce_macro.rs pass bit-exact on sm_89 including a 64-thread two-warp independence canary (warp 0 and warp 1 fed different patterns; asserts each warp gets its own reduction and the lowering does not blend across warps). 4 new trybuild compile-fail fixtures for the guard and the f32-type boundary. Host tests 330 pass 199 ignored (up from 187 baseline = 12 new GPU tests). No regression on showcases or matmul benches.

…nd canary

… A, B-packed, scales

…t4 module (64 regs/thread, 3.2 KB smem)

…cking spec

…89, max_rel 4e-6)

…atmul showcase + xtask integration

… xtask-bench runs (no tc/int8 regressions)

…fied tile constants

…fy sm_80 + sm_89)

…, 0 spills, sm_80/89)

…pre-zero

…6.s8 + 2 MovPack)

…s/call + frag_c in-place canary)

…80 regs (Rollback #1 next) Full INT8 W8A16 tri-output kernel: 13 params, 2 shared decls (tile_x 2048B + tile_w 512B), K-loop with 7-barrier Design-S cadence, 12 mma.sync m16n8k16.f16.f16.f32 per K-tile across 3 fragment-C grids, store-out epilogue with 24 packed b32 stores + 48 cvt.rn.f16.f32 + 48 mul.f32 (scalar scale fold per projection). ptxas (sm_80 + sm_89): 0 spills, valid PTX, but reports 80 registers per thread. The extra 16-40 regs over the D2.5 skeleton baseline (32-40) come from cooperative-loader intermediate state, fragment-A reload regs per mma, and per-K-tile address-math intermediates that the skeleton did not model. 80 > the 64-reg full-occupancy cliff. Plan Rollback #1 trigger fires - drop MMAS_PER_WARP_N from 2 to 1, halving the per-block output tile to 64x16. That rollback lands in the next commit.

…_80/89, 0 spills) ptxas_verify gate now PASSED both targets: sm_80: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff) sm_89: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff) Rollback #1 was authorized in the Sprint 7.3 plan as the explicit recovery for "register pressure overshoot at D2.5 OR D3 full kernel". D3.4 hit 80 regs/0 spills both targets at the original 64x32 output tile; halving MMAS_PER_WARP_N reclaims 24 frag_c regs/lane (3 grids x 4 frag_d regs x 2 fewer sub-tiles per warp) and drops the post-rollback measurement to 48 regs/lane. Per-block output tile: 64x32 -> 64x16. More blocks launch, each block lighter. X-reuse economics still favor the fused tri-output design over 3x standalone matmul_int8; the absolute-vs-relative bench tier in D8 will quantify the trade. Implementation: - BN_BLOCK 32 -> 16, WARP_QUAD_N 16 -> 8, MMAS_PER_WARP_N 2 -> 1 - TILE_W_BYTES kept at 512 (col-stride padded to 32) so cooperative pre-zero keeps its single-issue-per-thread shape; W loader gates writes on col_start < BN_BLOCK so only the first 16 cols receive real data - Test counts updated to derive from MMAS_PER_WARP_* rather than hard-code the 64x32 numbers (future-proof for further rollback iterations)

…7.3 MVS ship point) Wires `kaio_ops::qkv_project_int8` (W8A16: f16 activations × i8 weights, scalar per-projection scales) as the Sprint 7.3 minimum-viable-sprint deliverable. 13-param signature: device, x_f16, w_q/k/v_i8, scale_q/k/v, q/k/v_out_f16, m/n/k. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1), block (32,4,1). Rustdoc spells out W8A16-vs-W8A8 explicitly in the first paragraph (Codex round 2 discipline) so callers don't conflate this with matmul_int8's W8A8 contract; points users needing W8A8 at three matmul_int8 calls + their own activation quant. GPU launch smoke test added: zero inputs → zero outputs canary on canonical shape (M=64, N=16, K=16). Verifies module loads + kernel launches + writes outputs without driver error. Correctness vs reference is the D7 e2e suite's job. Quality gates GREEN at the MVS ship point: - cargo fmt clean - cargo clippy -D warnings clean - 73 host tests pass (up from 72 with the launch smoke) - cargo doc clean (no broken intra-doc links) - ptxas_verify PASSED sm_80 + sm_89 (48 regs, 0 spills) Continuing to D5 (INT4 contingent) - D2.5 + D3.4 stayed clean post-rollback.

…, 0 spills) W4A16 tri-output INT4 module: f16 activations × 3 packed-INT4 weight tensors × 3 group-scale tensors → 3 f16 outputs. Single kernel launch with Design-S serial fusion across Q/K/V and group-scale reload cadence (group_size=128, 8 K-tiles per group). Reuses matmul_int4 helpers via super::matmul_int4_kernel:: path (emit_mw_load_tile_a_f16_64x16, emit_unpack_s4_x2_scale_to_f16_pair, emit_fragment_b_int4_per_lane). Made the latter two pub(crate) for cross-kernel sharing. New helpers in qkv_project_int4_kernel.rs: - emit_mw_load_tile_w_packed_int4_2x16 (16-col packed-INT4 cooperative load) - emit_cooperative_load_group_scales_int4 (16-cell f16 scale load, 8 active threads) - emit_pre_zero_shared_tiles_qkv_int4 (3-tile pre-zero with predicated scales path) - emit_warp_quadrant_mma_int4_per_projection (2x1 mma sweep with INT4 dequant feed) - build_qkv_project_int4_module (full IR module - 13 params, 3 shared decls, K-loop with 7-barrier cadence + group-boundary scale reload) Same Rollback #1 as INT8 path (BN_BLOCK=16, MMAS_PER_WARP_N=1) - applied preemptively since INT4 has identical fragment-C live state plus extra register pressure from group-scale + nibble-extract chains. 56 regs/thread vs INT8's 48 (the 8-reg gap is the dequant + scale-reload state). Quality gates GREEN: - cargo fmt clean - cargo clippy -D warnings clean - 83 host tests pass (up from 73 - added 11 D5 emit/module/canary tests) - ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills, 8-reg headroom under the 64-reg cliff) - disjoint-register canary (24 distinct mma destination regs across 3 grids) D6 next: public qkv_project_int4 host API (contingent ship).

…7.3 contingent ship) Wires `kaio_ops::qkv_project_int4` (W4A16: f16 activations × packed-INT4 weights × f16 group scales) as Sprint 7.3's contingent second deliverable. 14-param signature: device, x_f16, w_q/k/v_packed_u32, scales_q/k/v_f16, q/k/v_out_f16, m/n/k, group_size. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1). Rustdoc spells out the packing convention (8 nibbles per u32, K-contiguous, col-major [K/8, N]) and group-scale layout ([K/group_size, N] row-major, group_size=128 fixed). Cross-links to matmul_int4 since both share the convention. Explicit "not a drop-in for AutoGPTQ/exllama/GGUF" disclaimer. GPU launch smoke test: zero inputs → zero outputs canary on canonical shape (M=64, N=16, K=128 = exactly one group). Verifies module loads + kernel launches. Quality gates GREEN: - cargo fmt + clippy -D warnings clean - 84 host tests pass (+1 D6 smoke) - cargo doc clean - ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills) - already verified at D5 Both INT8 (D4 MVS) and INT4 (D6 contingent) public APIs now shipped on phase7-rest. D7 (GPU e2e + Q/K/V differentiation canary) next.

…pass on sm_89) Two new test files with CPU-reference round-trip correctness + canary tests: kaio-ops/tests/qkv_project_int8_e2e.rs (7 tests): - 6 correctness tests across canonical shapes (smallest / one-block / multi-N / multi-M / larger-K / medium 256x128x256) - 1 Q/K/V differentiation canary (W_Q=+1, W_K=+2, W_V=+3, X=1, scale=1/K → outputs 1, 2, 3 per cell; catches fragment-C grid aliasing) kaio-ops/tests/qkv_project_int4_e2e.rs (8 tests): - 3 group-boundary shapes (1 group / 2 groups / 8 groups) exercising the group_idx = k_tile / K_TILE_GROUP_RATIO cadence - 3 multi-block shapes (multi-N / multi-M / larger) - 1 sign-extend canary (all weights = -8, output should be -K*8 per cell; catches shr.s32 → shr.u32 collapse) - 1 Q/K/V differentiation canary All 7 + 8 = 15 e2e tests PASSED on RTX 4090 sm_89. Tolerance gate max_rel_err < 1e-3 (loose enough for f16-rounding-step difference between kernel and reference; observed max_rel stays under 3e-4 in practice). TWO BUGS CAUGHT + FIXED in D7: 1. INT8 W loader double-counted block_col The emit_mw_load_tile_w_int8_16x32 loader added block_col to its global offset computation, but w_block_base_global already carried the block_col byte shift from build_qkv_project_int8_module. Result: every non-zero N-block computed from WRONG global W addresses, corrupting all multi-N-block shapes. Single-N-block tests passed because block_col=0 makes the double-count a no-op. Fix: use col_start (within-tile offset) for the global addr computation instead of col_global. 2. INT4 shared scales slot raced across projections build_qkv_project_int4_module originally reloaded scales only on group boundaries (every 8 K-tiles) but the single tile_scales slot was reused across Q/K/V per K-tile. After mma_V on K-tile 0, the slot held scales_V; on K-tile 1 (non-boundary) all three projections' mmas read scales_V instead of their own scales. The sign-extend + Q/K/V canaries passed because they use constant scales, but random-data correctness tests failed. Fix: reload scales every K-tile, unconditionally (32 B per projection per K-tile = 96 B/K-tile additional global bandwidth, negligible vs 384 B/K-tile for the W loads). Group-boundary-only optimization deferred to future bench-driven work with three separate scales slots. Quality gates GREEN: - cargo fmt + clippy -D warnings clean (factored QkvOutTuple alias for e2e round_trip return types) - 84 host tests pass - 15 GPU e2e tests PASS on sm_89 - ptxas_verify INT8: 48 regs, INT4: 56 regs, 0 spills both variants - Both kernels validated across sm_80 + sm_89 (W-loader fix did not change register count)

…iers) Sprint 7.3 D8. Compares fused qkv_project_int4 vs 3x sequential matmul_int4 (clean W4A16 apples-to-apples baseline), and reports absolute TOPS for qkv_project_int8 (W8A16 has no fair standalone baseline - matmul_int8 is W8A8). Shape sweep: decode tier (M ∈ {1, 64}, K/N ∈ {2048, 4096}) + prefill tier (M ∈ {512, 2048}, K/N = 4096). Plus 4 diagnostic shapes around the initial decode_m64 K/N=2048 anomaly to confirm it was measurement noise, not a kernel bug. Stable results on RTX 4090 sm_89 (release mode): Decode tier (M ≤ 128): fused ~3.0x baseline, consistent across shapes prefill_m512 (M=512): 1.19x (above ship threshold 1.15x) prefill_m2048 (M=2048): 0.85x (below 1.00x - fused loses 15% at largest prefill) Ship-B call (maintainer 2026-04-16): ship 7.3 with honest framing ("fused wins big at decode, ties/loses at prefill - use 3x standalone for prefill-heavy workloads"), plan Design S+1/2P optimization as Sprint 7.3.5 with full review cadence. Target for 7.3.5: recover prefill_m2048 from 0.85x to ≥1.15x by overlapping W_{P+1} load with mma_P (barriers 7 → 4 per K-tile). Outlier investigation (decode_m64 K/N=2048 reported 0.40x in first run): NOT reproducible on re-run. Diagnostic sweep across 4 nearby shapes shows consistent 3.0x throughout - confirmed transient measurement noise in the first run (likely first-call CUDA context warmup bleeding through despite 5-iter warmup loop). No kernel bug. Absolute TOPS numbers intentionally NOT highlighted - repeated back-to-back bench runs cause thermal throttling on the 4090 that crushes absolute throughput by 50-100x for all workloads including cuBLAS sgemm reference. Ratios are throttle-invariant (fused + baseline throttle equally) so ratio numbers remain the valid headline. Clean absolute-TOPS run deferred to D10 doc with cold-system methodology.

…ttention_tc) End-to-end Phase 7 pipeline demo. GPTQ-lite quantizes random f32 weights into KAIO's packed-INT4 + group-scale layout, runs fused qkv_project_int4 to produce three f16 outputs, feeds them to attention_tc, and compares against an f16 reference path (3x matmul_tc → attention_tc). Reports three quality metrics per the Sprint 7.3 plan D9: - cosine similarity (primary pass/fail, plan threshold 0.98 INT4) - max absolute error (worst-case row outlier) - mean relative error (aggregate quality) Plus a projection-stage cosine for localization - distinguishes "kernel bug" from "softmax amplified a small projection error" when the final number is below threshold. Showcase shape: SEQ=64, D_MODEL=128, D_HEAD=64, GROUP_SIZE=128 (decode-tier shape where fusion wins cleanly per D8 bench). Observed on RTX 4090 sm_89 with random f16 weights: - Projection Q cos_sim = 0.9975 (kernel working correctly) - Final attention cos_sim = 0.9655 (below 0.98 plan threshold) README + in-example printout frame this honestly per the plan: "Random f16 weights are a worst case for group-scale quantization fidelity. Real trained LLM weights have much tighter group statistics and land measurably better than these synthetic numbers." The kernel's job is correctly performing the quantized projection; softmax amplifies any projection-stage error non-linearly in the final output. This is expected behavior for random weights and will improve substantially on real model weights where per-group magnitude variance is much lower. Wired into `cargo xtask showcase qkvattn`.

…7.3.5 plan stub Completes Sprint 7.3 documentation pass: - docs/development/sprints/phase7/sprint_7_3.md - full decision log (D1-D10), bench table (all 9 shapes, no cherry-picking), architectural decisions (AD1-AD5), bug-catch trail (cvt.rn.f16.s8, INT8 W-loader double-count, INT4 scales-slot race), and the maintainer Ship-B decision. - docs/development/sprints/phase7/PHASE_7_LOG.md - new index for Phase 7 (Phase 6 has one, Phase 7 didn't); populated with 7.0 through 7.3 + the 7.3.5 / 7.4 planning rows. - docs/development/sprints/phase7/sprint_7_3_5.md - stub plan for Design S+1/2P (two W slots ping-pong, barriers 7→4). Success criterion pinned: prefill_m2048 from 0.85x to ≥1.15x on sm_89. Full reviewer cadence required (synchronization semantics change in inner loop); not a hotfix. - CHANGELOG.md [Unreleased] block - Sprint 7.3 entry covering both public APIs, the shared store-out helper, the kaio-core cvt.rn fix, the showcase example, and the honest decode-wins-prefill-loses framing. - README.md ops table - two new rows for qkv_project_int{4,8}. Quality gates at D10 sprint-close: - 84 kaio-ops host tests pass - 15 GPU e2e tests pass on sm_89 (7 INT8 + 8 INT4) - ptxas_verify PASSED sm_80 + sm_89 for both kernels + skeleton + store_out - cargo doc --no-deps clean - cargo fmt + clippy -D warnings clean across workspace

…s / sm_89 40 regs / 0 spills) Port qkv_skeleton.rs from 7.3 Design-S to 7.3.5 S+1/2P peak-pressure model: - 2 tile_w shared slots with 64B bank-phase padding (stride 576B, non-multiple of 128 per Design invariant #4, Gemini R3-1) - 3 unrolled K-iterations with frag_C back-edge state threaded across iterations (Opus R1b-2 / R3b-3 cross-iteration modelling requirement - forces ptxas to count frag_C at peak, not elide) - Runtime ctaid.x-indexed slot pointers for ping-pong (ptxas can't constant-fold slot selection - Design invariant #3) - Scales register-hoist modelled via runtime k_tile_group counter (6 regs total, Pre-review clarification #2) - In-flight cooperative-load address registers kept live across the mma epoch to model the overlap window D1 gate PASSED on both architectures with comfortable headroom: sm_80: 32 registers, 0 spills, 3136 B smem, 404 B cmem[0] sm_89: 40 registers, 0 spills, 3136 B smem, 404 B cmem[0] cmem[0] = 404 B (kernel params in constant bank 0) confirms the R3b-2 case (a) upfront: kernel args arrive uniform-broadcast- friendly by default, no GPR pressure from base pointers, escape hatch has no handle to grab. SASS inspection for the escape hatch is not needed at the skeleton stage. Rollback #2 not triggered. Proceeding to D2 (INT8 S+1/2P body). Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p kaio-ops --lib (84 passed, 7 ignored). Shared budget 3136 B matches plan (2048 tile_x + 512 slot0 + 64 pad + 512 slot1).

…9 56 regs / 0 spills) Port qkv_project_int8_kernel build function from 7.3 Design-S to 7.3.5 S+1/2P. Design-S helper emit_warp_quadrant_mma_int8_per_projection stays untouched (preserves 7.3 tests + callers); two new siblings added for the S+1/2P path: emit_warp_quadrant_mma_int8_per_projection_hoisted Per-projection mma sweep that reads frag_A from caller-provided pre-hoisted registers instead of reloading from tile_x. Required so the three projection mmas share a single K-tile-start frag_A hoist, making Design invariant #1 (tile_x overwrite-after-hoist) expressible. hoist_frag_as_for_warp_quadrant Loads both m_stripe frag_As from tile_x at K-tile start. The caller is contractually required to fire B1 bar.sync AFTER this call and BEFORE any tile_x write. emit_pre_zero_shared_tiles_qkv_int8_sp_half_p 3-slot pre-zero (tile_x + tile_w_slot0 + tile_w_slot1; the 64 B pad between slots is write-only padding, never pre-zeroed). Shared layout: tile_x (2048 B) + tile_w_slot0 (512 B) + tile_w_pad (64 B, Design invariant #4 bank-phase) + tile_w_slot1 (512 B) = 3136 B. Stride slot0 -> slot1 = 576 B, non-multiple of 128 (no cross-warp SMEM bank-port contention during overlap). K-loop body 4-barrier cadence (plan worked-example table): B1 (post-hoist) INVARIANT #1 — tile_x safe to overwrite Epoch 1: load W_K(k) -> NEXT slot; mma_Q uses CUR slot B2 (post-mma_Q) CUR slot safe to overwrite Epoch 2: load W_V(k) -> CUR slot; mma_K uses NEXT slot B3 (post-mma_K) Epoch 3: [overlap loads -> W_Q(k+1) + X(k+1), skipped on final K-tile via uniform @!p_not_last branch]; mma_V uses CUR slot B4 (post-mma_V) closes K-tile epoch Runtime ping-pong slot selection via and.b32 + setp.eq + selp.u32 (slot_idx = k_tile & 1; p_slot0 = slot_idx == 0; cur/next chosen via selp against r_tile_w_slot{0,1}). Overlap-skip predicate p_not_last = k_tile + 1 < num_k_tiles; uniform branch, no divergence. Total barrier count for 1-K-tile module = 2 setup (pre-zero + B0) + 4 per K-tile = 6. 7.3 Design S was 1 + 7 = 8. Per K-tile savings 3 barriers; scales with K. ptxas --verbose results (S+1/2P, archived R2-5 artifacts): sm_80: Used 64 registers, 0 bytes spill stores, 0 bytes spill loads sm_89: Used 56 registers, 0 bytes spill stores, 0 bytes spill loads vs 7.3 Design-S INT8 post-Rollback-#1 baseline (48 regs on both architectures): sm_89 +8 regs (hoisted frag_A 2-stripes + slot selection + overlap-load addr math), sm_80 +16 regs. sm_80 at the 64-reg ceiling exactly; sm_89 has 8 reg headroom. Both architectures clear the D2 gate. Escape hatch (Gemini R3-4) not needed — ptxas already placed base pointers in cmem[0] per D1 skeleton evidence. Test updates: - build_qkv_project_int8_module_structure now asserts 4 shared decls (tile_x + slot0 + pad + slot1), new K-loop label K_LOOP_QKV_INT8_SP_HALF_P, and OVERLAP_SKIP_QKV_INT8_SP_HALF_P label. - build_qkv_project_int8_module_emits_seven_barriers_per_k_tile renamed to _emits_four_barriers_per_k_tile_sp_half_p, expects 6 total bar.sync (was 8). Quality gates green: cargo fmt, cargo clippy -D warnings, cargo test -p kaio-ops --lib (84 passed, 7 ignored). D3 (GPU e2e regression + slot-mapping canary + determinism stress) comes next.

…s, 100x bit-exact) Add two test classes for the S+1/2P kernel beyond the 7.3 regression suite: - Slot-mapping canary (2 shapes): deliberately-distinguishable weights per projection (W_Q=1, W_K=2, W_V=3, X=all-1, scale=1) so expected output is exactly {K, 2K, 3K} per element. Catches deterministic ping-pong mis-wiring that random-data tests can pass by coincidence. Two K-tile variants: - K=32 (2 K-tiles) exercises the steady -> final transition. - K=48 (3 K-tiles) exercises steady -> steady -> final, covering the back-edge transition that the 2-K-tile version doesn't hit. - Determinism stress (2 shapes): runs the same input 100x back-to- back in release build and asserts bit-exact output across all runs (from run 1 - no warmup discards). Catches barrier- misplacement races that manifest as cross-launch nondeterminism under varying warp scheduling. - Short: M=256, N=512, K=1024 (64 K-tiles per launch). - Prefill: M=2048, N=512, K=4096 (256 K-tiles, the regime that motivated the S+1/2P rework). Results on RTX 4090 sm_89: - 7 existing e2e regression tests: pass (max_rel <= 4.88e-4, most bit-exact). - Q/K/V differentiation canary: pass. - Slot-mapping canary K=32: Q=32 K=64 V=96 exact. - Slot-mapping canary K=48: Q=48 K=96 V=144 exact. - Determinism short 100x: bit-exact. - Determinism prefill 100x: bit-exact. 11/11 INT8 GPU e2e tests pass. No kernel code changed in D3 - only tests added. Diagnostic in ptxas_verify: also report allocation at -maxrregcount 128 to distinguish natural register allocation from a ptxas compression-under-cap. Confirmed sm_80 64 regs is a genuine register-reuse compression below the 66-reg natural allocation (zero spills at default), not a hidden-spill pathology. sm_89 at 56 regs at default vs 66 natural - same pattern, larger headroom. Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p kaio-ops --lib (84 passed, 7 ignored).

…G.yml Sprint 7.3.5 closes with the asymmetric outcome the plan's 5-tier table pre-declared: INT8 ships Design S+1/2P (correctness clean, ptxas clean, abs TOPS baseline captured); INT4 S+1/2P was measured at median prefill_m2048 = 1.05x across 3 runs, improved from Design S's 0.85x but below the 1.15x plan-locked ship gate and the 1.10x ship-narrow floor - landed in "Measured, not shipped". INT4 retained at Design S on main; INT4 S+1/2P port archived on phase7-rest for potential future cp.async contingency work. Docs added / updated: - docs/development/sprints/phase7/sprint_7_3_5.md (new) - terse post-delivery outline with full measured-bench table, 5 architectural decisions, follow-up next steps. - docs/development/sprints/phase7/phase7_master_plan.md - status header, section 4 (Quant+Attention integration) rewritten for 7.3 / 7.3.5 outcome, Sprint Breakdown table rows added, dependency graph updated, Success Criterion 4 marked complete. - docs/development/sprints/phase7/PHASE_7_LOG.md - 7.3.5 status row + ops-shipped table notes. - CHANGELOG.md - Sprint 7.3.5 section under [Unreleased]. - README.md - INT8 row updated for S+1/2P; INT4 row explicitly notes Design S retained and reason. - kaio-ops/src/qkv_project_int8_kernel.rs - module-level rustdoc refreshed (Tri-output design and Register budget sections) to reflect post-Rollback-#1 + Sprint 7.3.5 S+1/2P state. Editorial polish pass across phase 1 / 4 / 6 / 7 sprint docs and across kaio-ops / kaio test-file comments - tightening voice for outward-facing readers. Technical rationale and measured results unchanged. .github/FUNDING.yml added for the Sponsor button - lands on main via the phase7-ship merge. Quality gates (cargo fmt --check, clippy --workspace --all-targets -D warnings, doc --workspace --no-deps, test --workspace) all green. ptxas_verify green on sm_80 + sm_89 for INT8 S+1/2P (64 / 56 regs, 0 spills) and INT4 Design S (56 / 56 regs, 0 spills).

5-shape abs-TOPS comparison, same session on RTX 4090 sm_89, kernel file swapped between 2dbad72 (Design-S) and HEAD (S+1/2P) via checkout-revert, identical thermal state both sides. | shape | Design-S | S+1/2P | speedup | |------------------|----------|--------|---------| | decode_m1 | 3.2 | 3.3 | 1.04x | | decode_m64 | 3.2 | 3.3 | 1.04x | | decode_m64_large | 9.8 | 10.2 | 1.04x | | prefill_m512 | 31.4 | 35.6 | 1.13x | | prefill_m2048 | 38.8 | 44.6 | 1.15x | S+1/2P wins across all 5 shapes. Gain scales with K-tile count per launch: decode (few K-tiles, barriers not dominant) gains ~4%; prefill_m2048 (256 K-tiles, barriers dominant in Design-S) gains 15% - exactly the threshold the sprint plan set for INT4, which INT4 narrowly missed at 1.05x vs 3x standalone matmul_int4. Different outcome between the variants at the same shape localises the INT4 gap: the INT8 path recovers cleanly under barrier reduction; the INT4 path (per-column f16 scale hoist + nibble- extract dequant) exposes memory-issue latency as the new binding constraint once barriers no longer dominate. cp.async targets that class of stall and is the right candidate for a follow-up contingency sprint - not as an alternative to S+1/2P but as the next bottleneck after S+1/2P succeeds. - sprint_7_3_5.md Performance section: INT8 table replaced with the like-for-like comparison + interpretation paragraph tying the INT8/INT4 asymmetry to the cp.async follow-up. - PHASE_7_LOG.md 7.3.5 headline: added the 1.15x like-for-like result alongside the INT4 Measured-not-shipped status.

Moves cudarc's dynamic-loading / dynamic-linking choice out of the workspace-level features list and into per-crate feature flags on kaio-runtime, kaio-ops, and the kaio umbrella. Default stays on dynamic-loading — standalone KAIO users see no behavior change. Prep for the kaio-candle bridge: candle-core activates cudarc/dynamic-linking, and cudarc's build.rs panics when both loading strategies are simultaneously active in the feature-unified dep graph. With this refactor, kaio-candle consumes the KAIO crates via default-features = false, features = ["dynamic-linking"] — both sides converge on dynamic-linking in the kaio-candle build graph, no collision. kaio-ops dev-dep on cudarc consolidated onto the workspace pin with features = ["cublas"] — loading strategy now driven uniformly by the crate feature block rather than duplicated in the dev-dep line.

Follow-on fix to the cudarc loading-strategy feature-gate refactor (b0213c2). Without `default-features = false` on kaio's kaio-runtime dep, kaio-runtime's default (dynamic-loading) stays active when downstream consumers opt into dynamic-linking, defeating the opt-in and triggering cudarc's dual-loading-strategy panic in the unified dep graph. Verified: candle-probe scratch crate pulls kaio + kaio-ops with default-features = false, features = ["dynamic-linking"] alongside candle-core's cuda feature. cargo check succeeds, cudarc builds once with only dynamic-linking active. G3-1 resolved.

Two parts: 1. kaio-runtime: additive bridge-facing API on GpuBuffer<T>. - #[repr(transparent)] attr with load-bearing docstring; compile-time size+align asserts for f32/f16/i8/u32 via static_assertions dev-dep. - pub fn from_cuda_slice + pub fn into_cuda_slice replace the old pub(crate) from_raw (device.rs callers updated). - Layout-invariant in place so the kaio-candle bridge can transmute &CudaSlice<T> to &GpuBuffer<T> for pass-through to kaio-ops without round-tripping through an owned clone. 2. kaio-candle: new standalone crate at kaio-candle/, excluded from the main workspace (cudarc dynamic-loading vs dynamic-linking mutual exclusion — see Sprint 7.4a plan Pre-flight P2 / G3-1). - Cargo.toml: package metadata, optional kaio + kaio-ops + candle-core deps gated behind `cuda` feature. - src/lib.rs: crate docs, feature-gated module declarations. - src/{bridge,matmul_tc,matmul_tc_async,matmul_int4,attention_tc}.rs: stub modules; content lands D2-D5. Gates (all green): - Main workspace: fmt + clippy -D warnings + tests (static_assertions fire at compile time in kaio-runtime lib tests). - Standalone kaio-candle: fmt + clippy --features cuda + clippy --no-default-features + cargo check on both feature paths + doc on both feature paths. Ship gate #10 (no-cuda build succeeds on no-toolkit host) satisfied — no compile_error! on the default path; bridge surface silently absent when `cuda` feature off. .gitignore: kaio-candle/target/ + kaio-candle/Cargo.lock (standalone library crate convention).

Adds the crate-private plumbing that D3-D5 per-op CustomOp impls will compose: - slice_ref_from_storage / storage_from_slice: cudarc CudaSlice<T> round-trip across candle's CudaDType trait (as_cuda_slice / wrap_cuda_slice). - buffer_ref_from_slice_readonly: #[repr(transparent)] cast from &CudaSlice<T> to &GpuBuffer<T> with full Codex R1 aliasing contract + Gemini G3-2 lifetime invariant docstring. - ensure_ordinal_match: AD1 enforcement via candle DeviceLocation pattern match + KaioDevice::ordinal() (new public getter; P4 preflight showed ordinal equality + primary-context sharing is the right mechanism, not Arc identity). - sync_before_launch / sync_after_launch: AD9 stream-safety fences via BackendDevice::synchronize. CUDA Graph limitation noted per Gemini G3-3. - ensure_rank2_contiguous_zero_offset: AD4 input gate rejecting non-rank-2, non-contiguous, and non-zero-offset layouts with concrete reshape hints per Opus #6 + #7 and Codex R3. - kaio_err: KaioError -> candle_core::Error helper (orphan rule prevents impl From). cudarc promoted from dev-dep to regular optional dep (gated on the cuda feature) so the bridge can name CudaSlice<T> directly. KaioDevice gets a pub fn ordinal() getter (additive, non-breaking). #![allow(dead_code)] on bridge.rs while per-op modules catch up. Gates: cargo check + clippy -D warnings green on both feature paths.

MatmulTcOp (CustomOp2) + matmul_tc() wrapper fn and MatmulTcAsyncOp (CustomOp2) + matmul_tc_async() wrapper fn, both following the AD-spec'd flow: 1. AD4 shape gate (rank-2 + contiguous + zero-offset) per input. 2. K-mismatch check across the two inputs. 3. AD1 ordinal equality via ensure_ordinal_match. 4. Dtype gate via bridge::slice_ref_from_storage::<f16>. 5. AD2-audited read-only shared borrow into candle-owned buffers. 6. Bridge-owned f32 output allocation via kaio_device.alloc_zeros. 7. AD9 pre-launch sync fence. 8. kaio-ops kernel call. 9. AD9 post-launch sync fence. 10. Output consumed back to CudaStorage via storage_from_slice. cpu_fwd on both ops returns a loud error; Q-CPU-Fallback in the plan resolves to "reject over silent candle.matmul fallback." Tests: - 4 host tests inline in bridge.rs for shape-gate happy path + rank-1/3/4 rejection with concrete reshape hints. - 8 GPU integration tests in tests/candle_gpu_roundtrip.rs: - 3 matmul_tc bit-exact cross-checks (64x64x64, 256x256x256, 1024x1024x1024) between bridge and raw kaio-ops calls on identical host-side bits. - 3 matmul_tc_async bit-exact cross-checks, same shapes. - 2 rejection-path tests exercising the contiguity (.t()) and non-zero-offset (.narrow(..)) paths from Codex R3 / Opus #7. All #[ignore]-gated per kaio-ops convention; run with `cargo test --features cuda -- --ignored` on a GPU host. Gates: - Main workspace: fmt + clippy -D warnings + lib tests green. - Standalone kaio-candle on cuda feature: fmt + clippy + cargo check + doc + 4 host tests + 8 GPU ignored tests ALL PASS BIT- EXACT on RTX 4090 sm_89. - Standalone kaio-candle on no-default-features: fmt + clippy + cargo check + doc (empty-shell build). New dev-dep: anyhow = "1" for terse test error handling across candle + cudarc + kaio error types.

MatmulInt4Op (CustomOp3) + matmul_int4() wrapper for GPTQ-style INT4 dequantize-matmul (W4A16). group_size locked at 128 per kaio-ops contract; K must be multiple of 128. Shape contract validated against kaio_ops::matmul_int4: a: f16 [M, K] b_packed: u32 [K/8, N] (8 INT4 per u32) scales: f16 [K/128, N] out: f32 [M, N] Bridge enforces the shape implications explicitly: - packed_rows == K/8, scale_rows == K/GROUP_SIZE, n_b == n_s. - Each with a concrete error message that tells the user exactly which derived dimension is off. AD2-Audit D4: kaio-ops kernel is read-only on A / B_packed / scales (validate_dims_int4 + dequant-fused inner loop reads through shared memory stagings; inputs never mutated). Same readonly-transmute contract as D3. Tests: - 2 bit-exact cross-check tests (256x256x128, 1024x1024x512). - 1 rejection test for K-not-multiple-of-128. - All #[ignore]-gated; run via cargo test --features cuda -- --ignored. Full 11 GPU integration tests pass on RTX 4090 sm_89: 8 matmul_tc / matmul_tc_async (D3) + 3 matmul_int4 (D4). Host tests unchanged at 4 (bridge shape gate). Also swapped `k % 128 == 0` to `.is_multiple_of(128)` per clippy::manual_is_multiple_of lint — stabilized helper.

…dings Single AttentionTcOp (CustomOp3) struct with a `causal: bool` field; two user-facing wrappers attention_tc() and attention_tc_causal() that flip the flag and select kernel variant. name() returns distinct kaio::attention_tc / kaio::attention_tc_causal strings so candle's module cache keys don't collide. Shape contract (P5): q: f16 [seq_q, d_k] k: f16 [seq_k, d_k] v: f16 [seq_k, d_v] out: f32 [seq_q, d_v] Bridge explicitly cross-checks d_k (Q vs K) and seq_k (K vs V) with targeted error messages. kaio-ops attention_tc has a seq_k ≤ 384 shared-memory cap (noted in the kernel's error message; FlashAttention-TC lifts this in a later sprint). Integration tests use 256x128 to fit cleanly. AD2-Audit D5: attention_tc / attention_tc_causal kernels are read-only on Q/K/V — validate_dims_attn + Q·Kᵀ/softmax/P·V all READ from shared stagings + fragments, inputs never mutated. Tests: - 4 bit-exact cross-check tests: - attention_tc_bit_exact_64x64 / 256x128 - attention_tc_causal_bit_exact_64x64 / 256x128 - All #[ignore]-gated; run via cargo test --features cuda -- --ignored. Full 15 GPU integration tests pass bit-exact on RTX 4090 sm_89: 3 matmul_tc + 3 matmul_tc_async + 2 matmul_int4 + 2 attention_tc + 2 attention_tc_causal + 2 matmul_tc rejection + 1 matmul_int4 rejection. Host tests: 4 bridge shape-gate tests. Main workspace gates + kaio-candle gates all green on both feature paths.

README covers standalone-crate rationale, build, quickstart, op surface, device lifetime, candle pin policy, and the v0.1 limitations list. examples/matmul_tc_candle.rs and examples/attention_tc_candle.rs are runnable with small shapes; ship-gate #14 satisfied (non-NaN output on RTX 4090 sm_89). Cargo.toml include list extended to publish examples.

.github/workflows/candle-head.yml builds kaio-candle against candle-core's git main once per Monday. Step-level continue-on-error plus peter-evans/create-issue-from-file@v5 on failure keeps upstream breakage visible without blocking other CI. README gets a workflow badge and a candle-version-policy paragraph pointing at the job.

sprint_7_4a.md writes up the forward-only kaio-candle bridge — context, scope, architectural decisions (standalone crate rationale, explicit KaioDevice, repr(transparent) aliasing contract, AD4 input gate, cuCtxSynchronize fences), results, and follow-ups (7.4b / 7.4c). PHASE_7_LOG gets a 7.4a row plus a candle-bridge ops table. CHANGELOG Unreleased gains a Sprint 7.4a section at the top. Root README notes kaio-candle in the crate table, flips the "inference only" bullet to point at the forward bridge, and expands the Phase 7 roadmap line to mention 7.4a / 7.4b / 7.4c.

Sixth forward CustomOp binding in kaio-candle: W8A8 symmetric-quant matmul via CustomOp2 with a scalar f32 scale threaded through the op struct (same pattern as AttentionTcOp::causal). Candle has no DType::I8, so the candle-side convention is DType::U8 tensors whose bytes are interpreted as signed INT8. The bridge reinterprets &CudaSlice<u8> as &CudaSlice<i8> via a metadata-only same-layout transmute (cudarc's CudaSlice<T> carries T only as PhantomData, u8 and i8 are 1-byte identical). 5 new GPU integration tests - 3 bit-exact cross-checks at 256/1024/4096 cubed with scale values 0.00125 / 1.0 / 47.3 (spread across regimes so a dropped-scale bug surfaces obviously-wrong in at least two of the three tests), 2 rejection tests. Brings total kaio-candle GPU tests to 20. All green on RTX 4090 sm_89. Zero new bridge primitives - every helper reused from 7.4a. Only the small reinterpret_u8_slice_as_i8 helper lives local to the op file.

…dation helper

New direct-call bridge pattern for ops that exceed candle's CustomOpN arity (max 3 inputs, single output). Free functions extract CudaStorage from &Tensor via storage_and_layout(), validate with named-parameter error messages, call the fused kernel, and return (Tensor, Tensor, Tensor) via Tensor::from_storage with BackpropOp::none(). qkv_project_int8: W8A16, 4 tensor + 3 scalar inputs, 3 f16 outputs. qkv_project_int4: W4A16, 7 tensor inputs, 3 f16 outputs, group_size=128. Output is DType::F16 (kernel does f32->f16 internally), unlike the CustomOp-based ops which return f32. Gradient-tracked inputs rejected with loud .detach() error to prevent silent autograd disconnection. 12 new GPU tests (5 bit-exact + 2 canaries + 5 rejection), 32 total. All green on RTX 4090 sm_89.

Replaced cuCtxSynchronize fences with event-based cross-stream sync via cudarc's CudaStream::join() (cuEventRecord + cuStreamWaitEvent internally). GPU-side only, no CPU blocking. sync_before_launch: KAIO stream joins candle stream (waits for candle's prior work before kernel launch). sync_after_launch: candle stream joins KAIO stream (waits for kernel completion before candle uses output). Both functions gain a &KaioDevice parameter; 7 call sites updated across all op modules. CUDA Graph capture partially unblocked - the cuCtxSynchronize blocker is removed, but default-stream capture is still banned by CUDA itself. Full Graph capture requires non-default streams on both sides (unverified). 33 GPU tests pass (32 bit-exact + 1 sync smoke test).

dmriding added 30 commits April 15, 2026 03:03

wip(phase7): D1 - Sprint 7.1.5 stub + IR audit (ShflSyncBfly covers w…

c39ba45

…arp_reduce)

wip(phase7): D2.0 - pre-refactor TokenStream snapshot canary for bloc…

2f28dfa

…k_reduce_sum

wip(phase7): D2 - factor emit_warp_tree_reduce helper (snapshot canar…

3edb5c7

…y byte-identical)

wip(phase7): D3+D4 - warp_reduce_{sum,max,min} + block_reduce_min DSL…

b850ee7

… builtins + whole-warp-multiple guard

wip(phase7): D5 - warp_reduce + block_reduce_min tests (12 new GPU + …

c5d2548

…4 trybuild)

wip(phase7): D1 - Sprint 7.2 stub + MovPack IR primitive

25d5314

wip(phase7): D2 - emit_unpack_s4_x8_scale_to_f16x8 helper + sign-exte…

1ab9b4b

…nd canary

wip(phase7): D3 - shared-memory layout + group-scale broadcast design

df5f771

wip(phase7): D4.1 - validate_dims_int4 + cooperative load helpers for…

ca371c4

… A, B-packed, scales

wip(phase7): D4.2+D4.3 - fragment-B per-lane dequant + full matmul_in…

666ff6d

…t4 module (64 regs/thread, 3.2 KB smem)

wip(phase7): D5 - matmul_int4 public API + host dispatch + rustdoc pa…

f2e45e9

…cking spec

wip(phase7): D6 - matmul_int4 GPU e2e tests (12 pass bit-exact on sm_…

12ee369

…89, max_rel 4e-6)

wip(phase7): D7 - matmul_int4_bench (42-57 TOPS 4096³ sm_89) + int4_m…

ba59e55

…atmul showcase + xtask integration

feat(phase7): Sprint 7.2 - INT4 dequantize-matmul

332cc83

docs(phase7): refine INT4 bench range to 49-58 TOPS median ~57 from 6…

e2f221b

… xtask-bench runs (no tc/int8 regressions)

wip(phase7): D1 - Sprint 7.3 stubs + validate_dims_qkv_int{4,8} + uni…

f96c1b6

…fied tile constants

wip(phase7): D2 - emit_store_fragment_c_f32_to_f16_packed (ptxas_veri…

49bdcaa

…fy sm_80 + sm_89)

wip(phase7): D2.5 - tri-output register-pressure skeleton (32-40 regs…

94c5558

…, 0 spills, sm_80/89)

wip(phase7): D3.1 - cooperative loaders for X (f16) + W (i8) tiles + …

a31797e

…pre-zero

wip(phase7): D3.2 - emit_fragment_b_int8_per_lane (4 ld.s8 + 4 cvt.f1…

2934d27

…6.s8 + 2 MovPack)

wip(phase7): D3.3 - emit_warp_quadrant_mma_int8_per_projection (4 mma…

dfa22a5

…s/call + frag_c in-place canary)

fix(core): emit cvt.rn for s8 source (ptxas requires rounding modifier)

3739d38

dmriding added 26 commits April 16, 2026 02:21

chore(phase7): remove pre-delivery sprint_7_3_5.md stub

2dbad72

fix(ci): catch cudarc dlopen panic in qkv test helpers

fd5cf64

refactor(bridge): promote reinterpret_u8_slice_as_i8 + add named vali…

fb749de

…dation helper

feat(phase7): 7.4d - matmul_tc + matmul_tc_async backward

ca0849a

chore: pre-release hygiene sweep + weighted gradient tests

f280d93

bump v0.4.0 + DSL docs clarification + RFC-0001 pointer syntax

760a375

dmriding merged commit c5ed152 into main Apr 17, 2026
4 checks passed

dmriding mentioned this pull request Apr 17, 2026

UB in kernel signature #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phase7 wrap#14

Phase7 wrap#14
dmriding merged 56 commits into
mainfrom
phase7-wrap

dmriding commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dmriding commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant