Phase7 ship by dmriding · Pull Request #12 · dmriding/kaio

dmriding · 2026-04-16T18:09:42Z

No description provided.

…arp_reduce)

…k_reduce_sum

…y byte-identical)

… builtins + whole-warp-multiple guard

…4 trybuild)

Bridge sprint between quant milestones. Adds the warp-level reduction primitives that close the single biggest DSL expressiveness gap after matmul (softmax / layer-norm / RMS-norm / loss functions are now writable in pure DSL without manual shfl.sync ladders), plus the missing min variants at both warp and block level. Ships on the phase7-rest branch with no independent crates.io release; Phase 7 closes with an aggregate release after 7.2, 7.3, 7.4. New DSL builtins (f32-only this sprint; integer and half-precision variants are a clean follow-up): - warp_reduce_sum/max/min: per-warp reduction via butterfly shuffle tree, every lane in the warp gets the full result in 5 rounds. - block_reduce_min: completes the sum/max/min trio at block level. Whole-warp-multiple compile-time guard: kernels with block sizes whose total thread count is not a multiple of 32 (1D: 16; 2D: (4,4) or (8,6)) fail to compile with a span pointing at the warp_reduce_* call. Catches the partial-warp shfl.sync UB statically. Covered by 4 trybuild compile-fail fixtures. Rustdoc documents the per-warp semantic clearly, the convergent- control-flow requirement (data-dependent branches are UB the macro cannot detect), and the identity-value padding pattern for ragged boundaries. Helper emit_warp_tree_reduce factored from the existing block_reduce path, parameterized by shuffle variant (down for block_reduce lane-0- only, bfly for warp_reduce all-lanes), combine op, and PTX type. Leaves a clean seam for future i32/f16/bf16 reductions. Pre-refactor TokenStream snapshot canary locks the D2 refactor to byte-identical PTX; snapshot canary remains green. Tests: 12 new #[ignore]d GPU tests in reduce_macro.rs pass bit-exact on sm_89 including a 64-thread two-warp independence canary (warp 0 and warp 1 fed different patterns; asserts each warp gets its own reduction and the lowering does not blend across warps). 4 new trybuild compile-fail fixtures for the guard and the f32-type boundary. Host tests 330 pass 199 ignored (up from 187 baseline = 12 new GPU tests). No regression on showcases or matmul benches.

…nd canary

… A, B-packed, scales

…t4 module (64 regs/thread, 3.2 KB smem)

…cking spec

…89, max_rel 4e-6)

…atmul showcase + xtask integration

… xtask-bench runs (no tc/int8 regressions)

…fied tile constants

…fy sm_80 + sm_89)

…, 0 spills, sm_80/89)

…pre-zero

…6.s8 + 2 MovPack)

…s/call + frag_c in-place canary)

…80 regs (Rollback #1 next) Full INT8 W8A16 tri-output kernel: 13 params, 2 shared decls (tile_x 2048B + tile_w 512B), K-loop with 7-barrier Design-S cadence, 12 mma.sync m16n8k16.f16.f16.f32 per K-tile across 3 fragment-C grids, store-out epilogue with 24 packed b32 stores + 48 cvt.rn.f16.f32 + 48 mul.f32 (scalar scale fold per projection). ptxas (sm_80 + sm_89): 0 spills, valid PTX, but reports 80 registers per thread. The extra 16-40 regs over the D2.5 skeleton baseline (32-40) come from cooperative-loader intermediate state, fragment-A reload regs per mma, and per-K-tile address-math intermediates that the skeleton did not model. 80 > the 64-reg full-occupancy cliff. Plan Rollback #1 trigger fires - drop MMAS_PER_WARP_N from 2 to 1, halving the per-block output tile to 64x16. That rollback lands in the next commit.

…_80/89, 0 spills) ptxas_verify gate now PASSED both targets: sm_80: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff) sm_89: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff) Rollback #1 was authorized in the Sprint 7.3 plan as the explicit recovery for "register pressure overshoot at D2.5 OR D3 full kernel". D3.4 hit 80 regs/0 spills both targets at the original 64x32 output tile; halving MMAS_PER_WARP_N reclaims 24 frag_c regs/lane (3 grids x 4 frag_d regs x 2 fewer sub-tiles per warp) and drops the post-rollback measurement to 48 regs/lane. Per-block output tile: 64x32 -> 64x16. More blocks launch, each block lighter. X-reuse economics still favor the fused tri-output design over 3x standalone matmul_int8; the absolute-vs-relative bench tier in D8 will quantify the trade. Implementation: - BN_BLOCK 32 -> 16, WARP_QUAD_N 16 -> 8, MMAS_PER_WARP_N 2 -> 1 - TILE_W_BYTES kept at 512 (col-stride padded to 32) so cooperative pre-zero keeps its single-issue-per-thread shape; W loader gates writes on col_start < BN_BLOCK so only the first 16 cols receive real data - Test counts updated to derive from MMAS_PER_WARP_* rather than hard-code the 64x32 numbers (future-proof for further rollback iterations)

…7.3 MVS ship point) Wires `kaio_ops::qkv_project_int8` (W8A16: f16 activations × i8 weights, scalar per-projection scales) as the Sprint 7.3 minimum-viable-sprint deliverable. 13-param signature: device, x_f16, w_q/k/v_i8, scale_q/k/v, q/k/v_out_f16, m/n/k. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1), block (32,4,1). Rustdoc spells out W8A16-vs-W8A8 explicitly in the first paragraph (Codex round 2 discipline) so callers don't conflate this with matmul_int8's W8A8 contract; points users needing W8A8 at three matmul_int8 calls + their own activation quant. GPU launch smoke test added: zero inputs → zero outputs canary on canonical shape (M=64, N=16, K=16). Verifies module loads + kernel launches + writes outputs without driver error. Correctness vs reference is the D7 e2e suite's job. Quality gates GREEN at the MVS ship point: - cargo fmt clean - cargo clippy -D warnings clean - 73 host tests pass (up from 72 with the launch smoke) - cargo doc clean (no broken intra-doc links) - ptxas_verify PASSED sm_80 + sm_89 (48 regs, 0 spills) Continuing to D5 (INT4 contingent) - D2.5 + D3.4 stayed clean post-rollback.

…, 0 spills) W4A16 tri-output INT4 module: f16 activations × 3 packed-INT4 weight tensors × 3 group-scale tensors → 3 f16 outputs. Single kernel launch with Design-S serial fusion across Q/K/V and group-scale reload cadence (group_size=128, 8 K-tiles per group). Reuses matmul_int4 helpers via super::matmul_int4_kernel:: path (emit_mw_load_tile_a_f16_64x16, emit_unpack_s4_x2_scale_to_f16_pair, emit_fragment_b_int4_per_lane). Made the latter two pub(crate) for cross-kernel sharing. New helpers in qkv_project_int4_kernel.rs: - emit_mw_load_tile_w_packed_int4_2x16 (16-col packed-INT4 cooperative load) - emit_cooperative_load_group_scales_int4 (16-cell f16 scale load, 8 active threads) - emit_pre_zero_shared_tiles_qkv_int4 (3-tile pre-zero with predicated scales path) - emit_warp_quadrant_mma_int4_per_projection (2x1 mma sweep with INT4 dequant feed) - build_qkv_project_int4_module (full IR module - 13 params, 3 shared decls, K-loop with 7-barrier cadence + group-boundary scale reload) Same Rollback #1 as INT8 path (BN_BLOCK=16, MMAS_PER_WARP_N=1) - applied preemptively since INT4 has identical fragment-C live state plus extra register pressure from group-scale + nibble-extract chains. 56 regs/thread vs INT8's 48 (the 8-reg gap is the dequant + scale-reload state). Quality gates GREEN: - cargo fmt clean - cargo clippy -D warnings clean - 83 host tests pass (up from 73 - added 11 D5 emit/module/canary tests) - ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills, 8-reg headroom under the 64-reg cliff) - disjoint-register canary (24 distinct mma destination regs across 3 grids) D6 next: public qkv_project_int4 host API (contingent ship).

…7.3 contingent ship) Wires `kaio_ops::qkv_project_int4` (W4A16: f16 activations × packed-INT4 weights × f16 group scales) as Sprint 7.3's contingent second deliverable. 14-param signature: device, x_f16, w_q/k/v_packed_u32, scales_q/k/v_f16, q/k/v_out_f16, m/n/k, group_size. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1). Rustdoc spells out the packing convention (8 nibbles per u32, K-contiguous, col-major [K/8, N]) and group-scale layout ([K/group_size, N] row-major, group_size=128 fixed). Cross-links to matmul_int4 since both share the convention. Explicit "not a drop-in for AutoGPTQ/exllama/GGUF" disclaimer. GPU launch smoke test: zero inputs → zero outputs canary on canonical shape (M=64, N=16, K=128 = exactly one group). Verifies module loads + kernel launches. Quality gates GREEN: - cargo fmt + clippy -D warnings clean - 84 host tests pass (+1 D6 smoke) - cargo doc clean - ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills) - already verified at D5 Both INT8 (D4 MVS) and INT4 (D6 contingent) public APIs now shipped on phase7-rest. D7 (GPU e2e + Q/K/V differentiation canary) next.

…pass on sm_89) Two new test files with CPU-reference round-trip correctness + canary tests: kaio-ops/tests/qkv_project_int8_e2e.rs (7 tests): - 6 correctness tests across canonical shapes (smallest / one-block / multi-N / multi-M / larger-K / medium 256x128x256) - 1 Q/K/V differentiation canary (W_Q=+1, W_K=+2, W_V=+3, X=1, scale=1/K → outputs 1, 2, 3 per cell; catches fragment-C grid aliasing) kaio-ops/tests/qkv_project_int4_e2e.rs (8 tests): - 3 group-boundary shapes (1 group / 2 groups / 8 groups) exercising the group_idx = k_tile / K_TILE_GROUP_RATIO cadence - 3 multi-block shapes (multi-N / multi-M / larger) - 1 sign-extend canary (all weights = -8, output should be -K*8 per cell; catches shr.s32 → shr.u32 collapse) - 1 Q/K/V differentiation canary All 7 + 8 = 15 e2e tests PASSED on RTX 4090 sm_89. Tolerance gate max_rel_err < 1e-3 (loose enough for f16-rounding-step difference between kernel and reference; observed max_rel stays under 3e-4 in practice). TWO BUGS CAUGHT + FIXED in D7: 1. INT8 W loader double-counted block_col The emit_mw_load_tile_w_int8_16x32 loader added block_col to its global offset computation, but w_block_base_global already carried the block_col byte shift from build_qkv_project_int8_module. Result: every non-zero N-block computed from WRONG global W addresses, corrupting all multi-N-block shapes. Single-N-block tests passed because block_col=0 makes the double-count a no-op. Fix: use col_start (within-tile offset) for the global addr computation instead of col_global. 2. INT4 shared scales slot raced across projections build_qkv_project_int4_module originally reloaded scales only on group boundaries (every 8 K-tiles) but the single tile_scales slot was reused across Q/K/V per K-tile. After mma_V on K-tile 0, the slot held scales_V; on K-tile 1 (non-boundary) all three projections' mmas read scales_V instead of their own scales. The sign-extend + Q/K/V canaries passed because they use constant scales, but random-data correctness tests failed. Fix: reload scales every K-tile, unconditionally (32 B per projection per K-tile = 96 B/K-tile additional global bandwidth, negligible vs 384 B/K-tile for the W loads). Group-boundary-only optimization deferred to future bench-driven work with three separate scales slots. Quality gates GREEN: - cargo fmt + clippy -D warnings clean (factored QkvOutTuple alias for e2e round_trip return types) - 84 host tests pass - 15 GPU e2e tests PASS on sm_89 - ptxas_verify INT8: 48 regs, INT4: 56 regs, 0 spills both variants - Both kernels validated across sm_80 + sm_89 (W-loader fix did not change register count)

…iers) Sprint 7.3 D8. Compares fused qkv_project_int4 vs 3x sequential matmul_int4 (clean W4A16 apples-to-apples baseline), and reports absolute TOPS for qkv_project_int8 (W8A16 has no fair standalone baseline - matmul_int8 is W8A8). Shape sweep: decode tier (M ∈ {1, 64}, K/N ∈ {2048, 4096}) + prefill tier (M ∈ {512, 2048}, K/N = 4096). Plus 4 diagnostic shapes around the initial decode_m64 K/N=2048 anomaly to confirm it was measurement noise, not a kernel bug. Stable results on RTX 4090 sm_89 (release mode): Decode tier (M ≤ 128): fused ~3.0x baseline, consistent across shapes prefill_m512 (M=512): 1.19x (above ship threshold 1.15x) prefill_m2048 (M=2048): 0.85x (below 1.00x - fused loses 15% at largest prefill) Ship-B call (maintainer 2026-04-16): ship 7.3 with honest framing ("fused wins big at decode, ties/loses at prefill - use 3x standalone for prefill-heavy workloads"), plan Design S+1/2P optimization as Sprint 7.3.5 with full review cadence. Target for 7.3.5: recover prefill_m2048 from 0.85x to ≥1.15x by overlapping W_{P+1} load with mma_P (barriers 7 → 4 per K-tile). Outlier investigation (decode_m64 K/N=2048 reported 0.40x in first run): NOT reproducible on re-run. Diagnostic sweep across 4 nearby shapes shows consistent 3.0x throughout - confirmed transient measurement noise in the first run (likely first-call CUDA context warmup bleeding through despite 5-iter warmup loop). No kernel bug. Absolute TOPS numbers intentionally NOT highlighted - repeated back-to-back bench runs cause thermal throttling on the 4090 that crushes absolute throughput by 50-100x for all workloads including cuBLAS sgemm reference. Ratios are throttle-invariant (fused + baseline throttle equally) so ratio numbers remain the valid headline. Clean absolute-TOPS run deferred to D10 doc with cold-system methodology.

…ttention_tc) End-to-end Phase 7 pipeline demo. GPTQ-lite quantizes random f32 weights into KAIO's packed-INT4 + group-scale layout, runs fused qkv_project_int4 to produce three f16 outputs, feeds them to attention_tc, and compares against an f16 reference path (3x matmul_tc → attention_tc). Reports three quality metrics per the Sprint 7.3 plan D9: - cosine similarity (primary pass/fail, plan threshold 0.98 INT4) - max absolute error (worst-case row outlier) - mean relative error (aggregate quality) Plus a projection-stage cosine for localization - distinguishes "kernel bug" from "softmax amplified a small projection error" when the final number is below threshold. Showcase shape: SEQ=64, D_MODEL=128, D_HEAD=64, GROUP_SIZE=128 (decode-tier shape where fusion wins cleanly per D8 bench). Observed on RTX 4090 sm_89 with random f16 weights: - Projection Q cos_sim = 0.9975 (kernel working correctly) - Final attention cos_sim = 0.9655 (below 0.98 plan threshold) README + in-example printout frame this honestly per the plan: "Random f16 weights are a worst case for group-scale quantization fidelity. Real trained LLM weights have much tighter group statistics and land measurably better than these synthetic numbers." The kernel's job is correctly performing the quantized projection; softmax amplifies any projection-stage error non-linearly in the final output. This is expected behavior for random weights and will improve substantially on real model weights where per-group magnitude variance is much lower. Wired into `cargo xtask showcase qkvattn`.

…7.3.5 plan stub Completes Sprint 7.3 documentation pass: - docs/development/sprints/phase7/sprint_7_3.md - full decision log (D1-D10), bench table (all 9 shapes, no cherry-picking), architectural decisions (AD1-AD5), bug-catch trail (cvt.rn.f16.s8, INT8 W-loader double-count, INT4 scales-slot race), and the maintainer Ship-B decision. - docs/development/sprints/phase7/PHASE_7_LOG.md - new index for Phase 7 (Phase 6 has one, Phase 7 didn't); populated with 7.0 through 7.3 + the 7.3.5 / 7.4 planning rows. - docs/development/sprints/phase7/sprint_7_3_5.md - stub plan for Design S+1/2P (two W slots ping-pong, barriers 7→4). Success criterion pinned: prefill_m2048 from 0.85x to ≥1.15x on sm_89. Full reviewer cadence required (synchronization semantics change in inner loop); not a hotfix. - CHANGELOG.md [Unreleased] block - Sprint 7.3 entry covering both public APIs, the shared store-out helper, the kaio-core cvt.rn fix, the showcase example, and the honest decode-wins-prefill-loses framing. - README.md ops table - two new rows for qkv_project_int{4,8}. Quality gates at D10 sprint-close: - 84 kaio-ops host tests pass - 15 GPU e2e tests pass on sm_89 (7 INT8 + 8 INT4) - ptxas_verify PASSED sm_80 + sm_89 for both kernels + skeleton + store_out - cargo doc --no-deps clean - cargo fmt + clippy -D warnings clean across workspace

…s / sm_89 40 regs / 0 spills) Port qkv_skeleton.rs from 7.3 Design-S to 7.3.5 S+1/2P peak-pressure model: - 2 tile_w shared slots with 64B bank-phase padding (stride 576B, non-multiple of 128 per Design invariant #4, Gemini R3-1) - 3 unrolled K-iterations with frag_C back-edge state threaded across iterations (Opus R1b-2 / R3b-3 cross-iteration modelling requirement - forces ptxas to count frag_C at peak, not elide) - Runtime ctaid.x-indexed slot pointers for ping-pong (ptxas can't constant-fold slot selection - Design invariant #3) - Scales register-hoist modelled via runtime k_tile_group counter (6 regs total, Pre-review clarification #2) - In-flight cooperative-load address registers kept live across the mma epoch to model the overlap window D1 gate PASSED on both architectures with comfortable headroom: sm_80: 32 registers, 0 spills, 3136 B smem, 404 B cmem[0] sm_89: 40 registers, 0 spills, 3136 B smem, 404 B cmem[0] cmem[0] = 404 B (kernel params in constant bank 0) confirms the R3b-2 case (a) upfront: kernel args arrive uniform-broadcast- friendly by default, no GPR pressure from base pointers, escape hatch has no handle to grab. SASS inspection for the escape hatch is not needed at the skeleton stage. Rollback #2 not triggered. Proceeding to D2 (INT8 S+1/2P body). Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p kaio-ops --lib (84 passed, 7 ignored). Shared budget 3136 B matches plan (2048 tile_x + 512 slot0 + 64 pad + 512 slot1).

…9 56 regs / 0 spills) Port qkv_project_int8_kernel build function from 7.3 Design-S to 7.3.5 S+1/2P. Design-S helper emit_warp_quadrant_mma_int8_per_projection stays untouched (preserves 7.3 tests + callers); two new siblings added for the S+1/2P path: emit_warp_quadrant_mma_int8_per_projection_hoisted Per-projection mma sweep that reads frag_A from caller-provided pre-hoisted registers instead of reloading from tile_x. Required so the three projection mmas share a single K-tile-start frag_A hoist, making Design invariant #1 (tile_x overwrite-after-hoist) expressible. hoist_frag_as_for_warp_quadrant Loads both m_stripe frag_As from tile_x at K-tile start. The caller is contractually required to fire B1 bar.sync AFTER this call and BEFORE any tile_x write. emit_pre_zero_shared_tiles_qkv_int8_sp_half_p 3-slot pre-zero (tile_x + tile_w_slot0 + tile_w_slot1; the 64 B pad between slots is write-only padding, never pre-zeroed). Shared layout: tile_x (2048 B) + tile_w_slot0 (512 B) + tile_w_pad (64 B, Design invariant #4 bank-phase) + tile_w_slot1 (512 B) = 3136 B. Stride slot0 -> slot1 = 576 B, non-multiple of 128 (no cross-warp SMEM bank-port contention during overlap). K-loop body 4-barrier cadence (plan worked-example table): B1 (post-hoist) INVARIANT #1 — tile_x safe to overwrite Epoch 1: load W_K(k) -> NEXT slot; mma_Q uses CUR slot B2 (post-mma_Q) CUR slot safe to overwrite Epoch 2: load W_V(k) -> CUR slot; mma_K uses NEXT slot B3 (post-mma_K) Epoch 3: [overlap loads -> W_Q(k+1) + X(k+1), skipped on final K-tile via uniform @!p_not_last branch]; mma_V uses CUR slot B4 (post-mma_V) closes K-tile epoch Runtime ping-pong slot selection via and.b32 + setp.eq + selp.u32 (slot_idx = k_tile & 1; p_slot0 = slot_idx == 0; cur/next chosen via selp against r_tile_w_slot{0,1}). Overlap-skip predicate p_not_last = k_tile + 1 < num_k_tiles; uniform branch, no divergence. Total barrier count for 1-K-tile module = 2 setup (pre-zero + B0) + 4 per K-tile = 6. 7.3 Design S was 1 + 7 = 8. Per K-tile savings 3 barriers; scales with K. ptxas --verbose results (S+1/2P, archived R2-5 artifacts): sm_80: Used 64 registers, 0 bytes spill stores, 0 bytes spill loads sm_89: Used 56 registers, 0 bytes spill stores, 0 bytes spill loads vs 7.3 Design-S INT8 post-Rollback-#1 baseline (48 regs on both architectures): sm_89 +8 regs (hoisted frag_A 2-stripes + slot selection + overlap-load addr math), sm_80 +16 regs. sm_80 at the 64-reg ceiling exactly; sm_89 has 8 reg headroom. Both architectures clear the D2 gate. Escape hatch (Gemini R3-4) not needed — ptxas already placed base pointers in cmem[0] per D1 skeleton evidence. Test updates: - build_qkv_project_int8_module_structure now asserts 4 shared decls (tile_x + slot0 + pad + slot1), new K-loop label K_LOOP_QKV_INT8_SP_HALF_P, and OVERLAP_SKIP_QKV_INT8_SP_HALF_P label. - build_qkv_project_int8_module_emits_seven_barriers_per_k_tile renamed to _emits_four_barriers_per_k_tile_sp_half_p, expects 6 total bar.sync (was 8). Quality gates green: cargo fmt, cargo clippy -D warnings, cargo test -p kaio-ops --lib (84 passed, 7 ignored). D3 (GPU e2e regression + slot-mapping canary + determinism stress) comes next.

…s, 100x bit-exact) Add two test classes for the S+1/2P kernel beyond the 7.3 regression suite: - Slot-mapping canary (2 shapes): deliberately-distinguishable weights per projection (W_Q=1, W_K=2, W_V=3, X=all-1, scale=1) so expected output is exactly {K, 2K, 3K} per element. Catches deterministic ping-pong mis-wiring that random-data tests can pass by coincidence. Two K-tile variants: - K=32 (2 K-tiles) exercises the steady -> final transition. - K=48 (3 K-tiles) exercises steady -> steady -> final, covering the back-edge transition that the 2-K-tile version doesn't hit. - Determinism stress (2 shapes): runs the same input 100x back-to- back in release build and asserts bit-exact output across all runs (from run 1 - no warmup discards). Catches barrier- misplacement races that manifest as cross-launch nondeterminism under varying warp scheduling. - Short: M=256, N=512, K=1024 (64 K-tiles per launch). - Prefill: M=2048, N=512, K=4096 (256 K-tiles, the regime that motivated the S+1/2P rework). Results on RTX 4090 sm_89: - 7 existing e2e regression tests: pass (max_rel <= 4.88e-4, most bit-exact). - Q/K/V differentiation canary: pass. - Slot-mapping canary K=32: Q=32 K=64 V=96 exact. - Slot-mapping canary K=48: Q=48 K=96 V=144 exact. - Determinism short 100x: bit-exact. - Determinism prefill 100x: bit-exact. 11/11 INT8 GPU e2e tests pass. No kernel code changed in D3 - only tests added. Diagnostic in ptxas_verify: also report allocation at -maxrregcount 128 to distinguish natural register allocation from a ptxas compression-under-cap. Confirmed sm_80 64 regs is a genuine register-reuse compression below the 66-reg natural allocation (zero spills at default), not a hidden-spill pathology. sm_89 at 56 regs at default vs 66 natural - same pattern, larger headroom. Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p kaio-ops --lib (84 passed, 7 ignored).

…G.yml Sprint 7.3.5 closes with the asymmetric outcome the plan's 5-tier table pre-declared: INT8 ships Design S+1/2P (correctness clean, ptxas clean, abs TOPS baseline captured); INT4 S+1/2P was measured at median prefill_m2048 = 1.05x across 3 runs, improved from Design S's 0.85x but below the 1.15x plan-locked ship gate and the 1.10x ship-narrow floor - landed in "Measured, not shipped". INT4 retained at Design S on main; INT4 S+1/2P port archived on phase7-rest for potential future cp.async contingency work. Docs added / updated: - docs/development/sprints/phase7/sprint_7_3_5.md (new) - terse post-delivery outline with full measured-bench table, 5 architectural decisions, follow-up next steps. - docs/development/sprints/phase7/phase7_master_plan.md - status header, section 4 (Quant+Attention integration) rewritten for 7.3 / 7.3.5 outcome, Sprint Breakdown table rows added, dependency graph updated, Success Criterion 4 marked complete. - docs/development/sprints/phase7/PHASE_7_LOG.md - 7.3.5 status row + ops-shipped table notes. - CHANGELOG.md - Sprint 7.3.5 section under [Unreleased]. - README.md - INT8 row updated for S+1/2P; INT4 row explicitly notes Design S retained and reason. - kaio-ops/src/qkv_project_int8_kernel.rs - module-level rustdoc refreshed (Tri-output design and Register budget sections) to reflect post-Rollback-#1 + Sprint 7.3.5 S+1/2P state. Editorial polish pass across phase 1 / 4 / 6 / 7 sprint docs and across kaio-ops / kaio test-file comments - tightening voice for outward-facing readers. Technical rationale and measured results unchanged. .github/FUNDING.yml added for the Sponsor button - lands on main via the phase7-ship merge. Quality gates (cargo fmt --check, clippy --workspace --all-targets -D warnings, doc --workspace --no-deps, test --workspace) all green. ptxas_verify green on sm_80 + sm_89 for INT8 S+1/2P (64 / 56 regs, 0 spills) and INT4 Design S (56 / 56 regs, 0 spills).

5-shape abs-TOPS comparison, same session on RTX 4090 sm_89, kernel file swapped between 2dbad72 (Design-S) and HEAD (S+1/2P) via checkout-revert, identical thermal state both sides. | shape | Design-S | S+1/2P | speedup | |------------------|----------|--------|---------| | decode_m1 | 3.2 | 3.3 | 1.04x | | decode_m64 | 3.2 | 3.3 | 1.04x | | decode_m64_large | 9.8 | 10.2 | 1.04x | | prefill_m512 | 31.4 | 35.6 | 1.13x | | prefill_m2048 | 38.8 | 44.6 | 1.15x | S+1/2P wins across all 5 shapes. Gain scales with K-tile count per launch: decode (few K-tiles, barriers not dominant) gains ~4%; prefill_m2048 (256 K-tiles, barriers dominant in Design-S) gains 15% - exactly the threshold the sprint plan set for INT4, which INT4 narrowly missed at 1.05x vs 3x standalone matmul_int4. Different outcome between the variants at the same shape localises the INT4 gap: the INT8 path recovers cleanly under barrier reduction; the INT4 path (per-column f16 scale hoist + nibble- extract dequant) exposes memory-issue latency as the new binding constraint once barriers no longer dominate. cp.async targets that class of stall and is the right candidate for a follow-up contingency sprint - not as an alternative to S+1/2P but as the next bottleneck after S+1/2P succeeds. - sprint_7_3_5.md Performance section: INT8 table replaced with the like-for-like comparison + interpretation paragraph tying the INT8/INT4 asymmetry to the cp.async follow-up. - PHASE_7_LOG.md 7.3.5 headline: added the 1.15x like-for-like result alongside the INT4 Measured-not-shipped status.

dmriding added 30 commits April 15, 2026 03:03

wip(phase7): D1 - Sprint 7.1.5 stub + IR audit (ShflSyncBfly covers w…

c39ba45

…arp_reduce)

wip(phase7): D2.0 - pre-refactor TokenStream snapshot canary for bloc…

2f28dfa

…k_reduce_sum

wip(phase7): D2 - factor emit_warp_tree_reduce helper (snapshot canar…

3edb5c7

…y byte-identical)

wip(phase7): D3+D4 - warp_reduce_{sum,max,min} + block_reduce_min DSL…

b850ee7

… builtins + whole-warp-multiple guard

wip(phase7): D5 - warp_reduce + block_reduce_min tests (12 new GPU + …

c5d2548

…4 trybuild)

wip(phase7): D1 - Sprint 7.2 stub + MovPack IR primitive

25d5314

wip(phase7): D2 - emit_unpack_s4_x8_scale_to_f16x8 helper + sign-exte…

1ab9b4b

…nd canary

wip(phase7): D3 - shared-memory layout + group-scale broadcast design

df5f771

wip(phase7): D4.1 - validate_dims_int4 + cooperative load helpers for…

ca371c4

… A, B-packed, scales

wip(phase7): D4.2+D4.3 - fragment-B per-lane dequant + full matmul_in…

666ff6d

…t4 module (64 regs/thread, 3.2 KB smem)

wip(phase7): D5 - matmul_int4 public API + host dispatch + rustdoc pa…

f2e45e9

…cking spec

wip(phase7): D6 - matmul_int4 GPU e2e tests (12 pass bit-exact on sm_…

12ee369

…89, max_rel 4e-6)

wip(phase7): D7 - matmul_int4_bench (42-57 TOPS 4096³ sm_89) + int4_m…

ba59e55

…atmul showcase + xtask integration

feat(phase7): Sprint 7.2 - INT4 dequantize-matmul

332cc83

docs(phase7): refine INT4 bench range to 49-58 TOPS median ~57 from 6…

e2f221b

… xtask-bench runs (no tc/int8 regressions)

wip(phase7): D1 - Sprint 7.3 stubs + validate_dims_qkv_int{4,8} + uni…

f96c1b6

…fied tile constants

wip(phase7): D2 - emit_store_fragment_c_f32_to_f16_packed (ptxas_veri…

49bdcaa

…fy sm_80 + sm_89)

wip(phase7): D2.5 - tri-output register-pressure skeleton (32-40 regs…

94c5558

…, 0 spills, sm_80/89)

wip(phase7): D3.1 - cooperative loaders for X (f16) + W (i8) tiles + …

a31797e

…pre-zero

wip(phase7): D3.2 - emit_fragment_b_int8_per_lane (4 ld.s8 + 4 cvt.f1…

2934d27

…6.s8 + 2 MovPack)

wip(phase7): D3.3 - emit_warp_quadrant_mma_int8_per_projection (4 mma…

dfa22a5

…s/call + frag_c in-place canary)

fix(core): emit cvt.rn for s8 source (ptxas requires rounding modifier)

3739d38

dmriding added 9 commits April 16, 2026 02:21

chore(phase7): remove pre-delivery sprint_7_3_5.md stub

2dbad72

fix(ci): catch cudarc dlopen panic in qkv test helpers

fd5cf64

dmriding closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phase7 ship#12

Phase7 ship#12
dmriding wants to merge 39 commits into
mainfrom
phase7-ship

dmriding commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dmriding commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant