Phase7 ship#12
Closed
dmriding wants to merge 39 commits into
Closed
Conversation
…y byte-identical)
… builtins + whole-warp-multiple guard
Bridge sprint between quant milestones. Adds the warp-level reduction primitives that close the single biggest DSL expressiveness gap after matmul (softmax / layer-norm / RMS-norm / loss functions are now writable in pure DSL without manual shfl.sync ladders), plus the missing min variants at both warp and block level. Ships on the phase7-rest branch with no independent crates.io release; Phase 7 closes with an aggregate release after 7.2, 7.3, 7.4. New DSL builtins (f32-only this sprint; integer and half-precision variants are a clean follow-up): - warp_reduce_sum/max/min: per-warp reduction via butterfly shuffle tree, every lane in the warp gets the full result in 5 rounds. - block_reduce_min: completes the sum/max/min trio at block level. Whole-warp-multiple compile-time guard: kernels with block sizes whose total thread count is not a multiple of 32 (1D: 16; 2D: (4,4) or (8,6)) fail to compile with a span pointing at the warp_reduce_* call. Catches the partial-warp shfl.sync UB statically. Covered by 4 trybuild compile-fail fixtures. Rustdoc documents the per-warp semantic clearly, the convergent- control-flow requirement (data-dependent branches are UB the macro cannot detect), and the identity-value padding pattern for ragged boundaries. Helper emit_warp_tree_reduce factored from the existing block_reduce path, parameterized by shuffle variant (down for block_reduce lane-0- only, bfly for warp_reduce all-lanes), combine op, and PTX type. Leaves a clean seam for future i32/f16/bf16 reductions. Pre-refactor TokenStream snapshot canary locks the D2 refactor to byte-identical PTX; snapshot canary remains green. Tests: 12 new #[ignore]d GPU tests in reduce_macro.rs pass bit-exact on sm_89 including a 64-thread two-warp independence canary (warp 0 and warp 1 fed different patterns; asserts each warp gets its own reduction and the lowering does not blend across warps). 4 new trybuild compile-fail fixtures for the guard and the f32-type boundary. Host tests 330 pass 199 ignored (up from 187 baseline = 12 new GPU tests). No regression on showcases or matmul benches.
… A, B-packed, scales
…t4 module (64 regs/thread, 3.2 KB smem)
…89, max_rel 4e-6)
…atmul showcase + xtask integration
… xtask-bench runs (no tc/int8 regressions)
…fied tile constants
…fy sm_80 + sm_89)
…, 0 spills, sm_80/89)
…6.s8 + 2 MovPack)
…s/call + frag_c in-place canary)
…80 regs (Rollback #1 next) Full INT8 W8A16 tri-output kernel: 13 params, 2 shared decls (tile_x 2048B + tile_w 512B), K-loop with 7-barrier Design-S cadence, 12 mma.sync m16n8k16.f16.f16.f32 per K-tile across 3 fragment-C grids, store-out epilogue with 24 packed b32 stores + 48 cvt.rn.f16.f32 + 48 mul.f32 (scalar scale fold per projection). ptxas (sm_80 + sm_89): 0 spills, valid PTX, but reports 80 registers per thread. The extra 16-40 regs over the D2.5 skeleton baseline (32-40) come from cooperative-loader intermediate state, fragment-A reload regs per mma, and per-K-tile address-math intermediates that the skeleton did not model. 80 > the 64-reg full-occupancy cliff. Plan Rollback #1 trigger fires - drop MMAS_PER_WARP_N from 2 to 1, halving the per-block output tile to 64x16. That rollback lands in the next commit.
…_80/89, 0 spills) ptxas_verify gate now PASSED both targets: sm_80: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff) sm_89: 48 registers, 0 spills (16-reg headroom under the 64-reg cliff) Rollback #1 was authorized in the Sprint 7.3 plan as the explicit recovery for "register pressure overshoot at D2.5 OR D3 full kernel". D3.4 hit 80 regs/0 spills both targets at the original 64x32 output tile; halving MMAS_PER_WARP_N reclaims 24 frag_c regs/lane (3 grids x 4 frag_d regs x 2 fewer sub-tiles per warp) and drops the post-rollback measurement to 48 regs/lane. Per-block output tile: 64x32 -> 64x16. More blocks launch, each block lighter. X-reuse economics still favor the fused tri-output design over 3x standalone matmul_int8; the absolute-vs-relative bench tier in D8 will quantify the trade. Implementation: - BN_BLOCK 32 -> 16, WARP_QUAD_N 16 -> 8, MMAS_PER_WARP_N 2 -> 1 - TILE_W_BYTES kept at 512 (col-stride padded to 32) so cooperative pre-zero keeps its single-issue-per-thread shape; W loader gates writes on col_start < BN_BLOCK so only the first 16 cols receive real data - Test counts updated to derive from MMAS_PER_WARP_* rather than hard-code the 64x32 numbers (future-proof for further rollback iterations)
…7.3 MVS ship point) Wires `kaio_ops::qkv_project_int8` (W8A16: f16 activations × i8 weights, scalar per-projection scales) as the Sprint 7.3 minimum-viable-sprint deliverable. 13-param signature: device, x_f16, w_q/k/v_i8, scale_q/k/v, q/k/v_out_f16, m/n/k. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1), block (32,4,1). Rustdoc spells out W8A16-vs-W8A8 explicitly in the first paragraph (Codex round 2 discipline) so callers don't conflate this with matmul_int8's W8A8 contract; points users needing W8A8 at three matmul_int8 calls + their own activation quant. GPU launch smoke test added: zero inputs → zero outputs canary on canonical shape (M=64, N=16, K=16). Verifies module loads + kernel launches + writes outputs without driver error. Correctness vs reference is the D7 e2e suite's job. Quality gates GREEN at the MVS ship point: - cargo fmt clean - cargo clippy -D warnings clean - 73 host tests pass (up from 72 with the launch smoke) - cargo doc clean (no broken intra-doc links) - ptxas_verify PASSED sm_80 + sm_89 (48 regs, 0 spills) Continuing to D5 (INT4 contingent) - D2.5 + D3.4 stayed clean post-rollback.
…, 0 spills)
W4A16 tri-output INT4 module: f16 activations × 3 packed-INT4 weight tensors ×
3 group-scale tensors → 3 f16 outputs. Single kernel launch with Design-S
serial fusion across Q/K/V and group-scale reload cadence (group_size=128,
8 K-tiles per group).
Reuses matmul_int4 helpers via super::matmul_int4_kernel:: path
(emit_mw_load_tile_a_f16_64x16, emit_unpack_s4_x2_scale_to_f16_pair,
emit_fragment_b_int4_per_lane). Made the latter two pub(crate) for
cross-kernel sharing.
New helpers in qkv_project_int4_kernel.rs:
- emit_mw_load_tile_w_packed_int4_2x16 (16-col packed-INT4 cooperative load)
- emit_cooperative_load_group_scales_int4 (16-cell f16 scale load, 8 active threads)
- emit_pre_zero_shared_tiles_qkv_int4 (3-tile pre-zero with predicated scales path)
- emit_warp_quadrant_mma_int4_per_projection (2x1 mma sweep with INT4 dequant feed)
- build_qkv_project_int4_module (full IR module - 13 params, 3 shared decls,
K-loop with 7-barrier cadence + group-boundary scale reload)
Same Rollback #1 as INT8 path (BN_BLOCK=16, MMAS_PER_WARP_N=1) - applied
preemptively since INT4 has identical fragment-C live state plus extra
register pressure from group-scale + nibble-extract chains. 56 regs/thread
vs INT8's 48 (the 8-reg gap is the dequant + scale-reload state).
Quality gates GREEN:
- cargo fmt clean
- cargo clippy -D warnings clean
- 83 host tests pass (up from 73 - added 11 D5 emit/module/canary tests)
- ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills, 8-reg headroom under
the 64-reg cliff)
- disjoint-register canary (24 distinct mma destination regs across 3 grids)
D6 next: public qkv_project_int4 host API (contingent ship).
…7.3 contingent ship) Wires `kaio_ops::qkv_project_int4` (W4A16: f16 activations × packed-INT4 weights × f16 group scales) as Sprint 7.3's contingent second deliverable. 14-param signature: device, x_f16, w_q/k/v_packed_u32, scales_q/k/v_f16, q/k/v_out_f16, m/n/k, group_size. LaunchConfig grid (n.div_ceil(BN_BLOCK), m.div_ceil(BM_BLOCK), 1). Rustdoc spells out the packing convention (8 nibbles per u32, K-contiguous, col-major [K/8, N]) and group-scale layout ([K/group_size, N] row-major, group_size=128 fixed). Cross-links to matmul_int4 since both share the convention. Explicit "not a drop-in for AutoGPTQ/exllama/GGUF" disclaimer. GPU launch smoke test: zero inputs → zero outputs canary on canonical shape (M=64, N=16, K=128 = exactly one group). Verifies module loads + kernel launches. Quality gates GREEN: - cargo fmt + clippy -D warnings clean - 84 host tests pass (+1 D6 smoke) - cargo doc clean - ptxas_verify PASSED sm_80 + sm_89 (56 regs, 0 spills) - already verified at D5 Both INT8 (D4 MVS) and INT4 (D6 contingent) public APIs now shipped on phase7-rest. D7 (GPU e2e + Q/K/V differentiation canary) next.
…pass on sm_89)
Two new test files with CPU-reference round-trip correctness + canary tests:
kaio-ops/tests/qkv_project_int8_e2e.rs (7 tests):
- 6 correctness tests across canonical shapes (smallest / one-block / multi-N /
multi-M / larger-K / medium 256x128x256)
- 1 Q/K/V differentiation canary (W_Q=+1, W_K=+2, W_V=+3, X=1, scale=1/K
→ outputs 1, 2, 3 per cell; catches fragment-C grid aliasing)
kaio-ops/tests/qkv_project_int4_e2e.rs (8 tests):
- 3 group-boundary shapes (1 group / 2 groups / 8 groups) exercising the
group_idx = k_tile / K_TILE_GROUP_RATIO cadence
- 3 multi-block shapes (multi-N / multi-M / larger)
- 1 sign-extend canary (all weights = -8, output should be -K*8 per cell;
catches shr.s32 → shr.u32 collapse)
- 1 Q/K/V differentiation canary
All 7 + 8 = 15 e2e tests PASSED on RTX 4090 sm_89. Tolerance gate
max_rel_err < 1e-3 (loose enough for f16-rounding-step difference between
kernel and reference; observed max_rel stays under 3e-4 in practice).
TWO BUGS CAUGHT + FIXED in D7:
1. INT8 W loader double-counted block_col
The emit_mw_load_tile_w_int8_16x32 loader added block_col to its global
offset computation, but w_block_base_global already carried the
block_col byte shift from build_qkv_project_int8_module. Result: every
non-zero N-block computed from WRONG global W addresses, corrupting all
multi-N-block shapes. Single-N-block tests passed because block_col=0
makes the double-count a no-op. Fix: use col_start (within-tile offset)
for the global addr computation instead of col_global.
2. INT4 shared scales slot raced across projections
build_qkv_project_int4_module originally reloaded scales only on group
boundaries (every 8 K-tiles) but the single tile_scales slot was reused
across Q/K/V per K-tile. After mma_V on K-tile 0, the slot held
scales_V; on K-tile 1 (non-boundary) all three projections' mmas read
scales_V instead of their own scales. The sign-extend + Q/K/V canaries
passed because they use constant scales, but random-data correctness
tests failed. Fix: reload scales every K-tile, unconditionally (32 B per
projection per K-tile = 96 B/K-tile additional global bandwidth,
negligible vs 384 B/K-tile for the W loads). Group-boundary-only
optimization deferred to future bench-driven work with three separate
scales slots.
Quality gates GREEN:
- cargo fmt + clippy -D warnings clean (factored QkvOutTuple alias for
e2e round_trip return types)
- 84 host tests pass
- 15 GPU e2e tests PASS on sm_89
- ptxas_verify INT8: 48 regs, INT4: 56 regs, 0 spills both variants
- Both kernels validated across sm_80 + sm_89 (W-loader fix did not
change register count)
…iers)
Sprint 7.3 D8. Compares fused qkv_project_int4 vs 3x sequential matmul_int4
(clean W4A16 apples-to-apples baseline), and reports absolute TOPS for
qkv_project_int8 (W8A16 has no fair standalone baseline - matmul_int8 is W8A8).
Shape sweep: decode tier (M ∈ {1, 64}, K/N ∈ {2048, 4096}) + prefill tier
(M ∈ {512, 2048}, K/N = 4096). Plus 4 diagnostic shapes around the initial
decode_m64 K/N=2048 anomaly to confirm it was measurement noise, not a kernel
bug.
Stable results on RTX 4090 sm_89 (release mode):
Decode tier (M ≤ 128): fused ~3.0x baseline, consistent across shapes
prefill_m512 (M=512): 1.19x (above ship threshold 1.15x)
prefill_m2048 (M=2048): 0.85x (below 1.00x - fused loses 15% at largest prefill)
Ship-B call (maintainer 2026-04-16): ship 7.3 with honest framing
("fused wins big at decode, ties/loses at prefill - use 3x standalone for
prefill-heavy workloads"), plan Design S+1/2P optimization as Sprint 7.3.5
with full review cadence. Target for 7.3.5: recover prefill_m2048 from 0.85x
to ≥1.15x by overlapping W_{P+1} load with mma_P (barriers 7 → 4 per K-tile).
Outlier investigation (decode_m64 K/N=2048 reported 0.40x in first run):
NOT reproducible on re-run. Diagnostic sweep across 4 nearby shapes shows
consistent 3.0x throughout - confirmed transient measurement noise in the
first run (likely first-call CUDA context warmup bleeding through despite
5-iter warmup loop). No kernel bug.
Absolute TOPS numbers intentionally NOT highlighted - repeated back-to-back
bench runs cause thermal throttling on the 4090 that crushes absolute
throughput by 50-100x for all workloads including cuBLAS sgemm reference.
Ratios are throttle-invariant (fused + baseline throttle equally) so ratio
numbers remain the valid headline. Clean absolute-TOPS run deferred to D10
doc with cold-system methodology.
…ttention_tc) End-to-end Phase 7 pipeline demo. GPTQ-lite quantizes random f32 weights into KAIO's packed-INT4 + group-scale layout, runs fused qkv_project_int4 to produce three f16 outputs, feeds them to attention_tc, and compares against an f16 reference path (3x matmul_tc → attention_tc). Reports three quality metrics per the Sprint 7.3 plan D9: - cosine similarity (primary pass/fail, plan threshold 0.98 INT4) - max absolute error (worst-case row outlier) - mean relative error (aggregate quality) Plus a projection-stage cosine for localization - distinguishes "kernel bug" from "softmax amplified a small projection error" when the final number is below threshold. Showcase shape: SEQ=64, D_MODEL=128, D_HEAD=64, GROUP_SIZE=128 (decode-tier shape where fusion wins cleanly per D8 bench). Observed on RTX 4090 sm_89 with random f16 weights: - Projection Q cos_sim = 0.9975 (kernel working correctly) - Final attention cos_sim = 0.9655 (below 0.98 plan threshold) README + in-example printout frame this honestly per the plan: "Random f16 weights are a worst case for group-scale quantization fidelity. Real trained LLM weights have much tighter group statistics and land measurably better than these synthetic numbers." The kernel's job is correctly performing the quantized projection; softmax amplifies any projection-stage error non-linearly in the final output. This is expected behavior for random weights and will improve substantially on real model weights where per-group magnitude variance is much lower. Wired into `cargo xtask showcase qkvattn`.
…7.3.5 plan stub
Completes Sprint 7.3 documentation pass:
- docs/development/sprints/phase7/sprint_7_3.md - full decision log (D1-D10),
bench table (all 9 shapes, no cherry-picking), architectural decisions
(AD1-AD5), bug-catch trail (cvt.rn.f16.s8, INT8 W-loader double-count,
INT4 scales-slot race), and the maintainer Ship-B decision.
- docs/development/sprints/phase7/PHASE_7_LOG.md - new index for Phase 7
(Phase 6 has one, Phase 7 didn't); populated with 7.0 through 7.3 + the
7.3.5 / 7.4 planning rows.
- docs/development/sprints/phase7/sprint_7_3_5.md - stub plan for Design
S+1/2P (two W slots ping-pong, barriers 7→4). Success criterion pinned:
prefill_m2048 from 0.85x to ≥1.15x on sm_89. Full reviewer cadence
required (synchronization semantics change in inner loop); not a hotfix.
- CHANGELOG.md [Unreleased] block - Sprint 7.3 entry covering both public
APIs, the shared store-out helper, the kaio-core cvt.rn fix, the
showcase example, and the honest decode-wins-prefill-loses framing.
- README.md ops table - two new rows for qkv_project_int{4,8}.
Quality gates at D10 sprint-close:
- 84 kaio-ops host tests pass
- 15 GPU e2e tests pass on sm_89 (7 INT8 + 8 INT4)
- ptxas_verify PASSED sm_80 + sm_89 for both kernels + skeleton + store_out
- cargo doc --no-deps clean
- cargo fmt + clippy -D warnings clean across workspace
…s / sm_89 40 regs / 0 spills) Port qkv_skeleton.rs from 7.3 Design-S to 7.3.5 S+1/2P peak-pressure model: - 2 tile_w shared slots with 64B bank-phase padding (stride 576B, non-multiple of 128 per Design invariant #4, Gemini R3-1) - 3 unrolled K-iterations with frag_C back-edge state threaded across iterations (Opus R1b-2 / R3b-3 cross-iteration modelling requirement - forces ptxas to count frag_C at peak, not elide) - Runtime ctaid.x-indexed slot pointers for ping-pong (ptxas can't constant-fold slot selection - Design invariant #3) - Scales register-hoist modelled via runtime k_tile_group counter (6 regs total, Pre-review clarification #2) - In-flight cooperative-load address registers kept live across the mma epoch to model the overlap window D1 gate PASSED on both architectures with comfortable headroom: sm_80: 32 registers, 0 spills, 3136 B smem, 404 B cmem[0] sm_89: 40 registers, 0 spills, 3136 B smem, 404 B cmem[0] cmem[0] = 404 B (kernel params in constant bank 0) confirms the R3b-2 case (a) upfront: kernel args arrive uniform-broadcast- friendly by default, no GPR pressure from base pointers, escape hatch has no handle to grab. SASS inspection for the escape hatch is not needed at the skeleton stage. Rollback #2 not triggered. Proceeding to D2 (INT8 S+1/2P body). Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p kaio-ops --lib (84 passed, 7 ignored). Shared budget 3136 B matches plan (2048 tile_x + 512 slot0 + 64 pad + 512 slot1).
…9 56 regs / 0 spills)
Port qkv_project_int8_kernel build function from 7.3 Design-S to
7.3.5 S+1/2P. Design-S helper emit_warp_quadrant_mma_int8_per_projection
stays untouched (preserves 7.3 tests + callers); two new siblings
added for the S+1/2P path:
emit_warp_quadrant_mma_int8_per_projection_hoisted
Per-projection mma sweep that reads frag_A from caller-provided
pre-hoisted registers instead of reloading from tile_x. Required
so the three projection mmas share a single K-tile-start frag_A
hoist, making Design invariant #1 (tile_x overwrite-after-hoist)
expressible.
hoist_frag_as_for_warp_quadrant
Loads both m_stripe frag_As from tile_x at K-tile start. The
caller is contractually required to fire B1 bar.sync AFTER this
call and BEFORE any tile_x write.
emit_pre_zero_shared_tiles_qkv_int8_sp_half_p
3-slot pre-zero (tile_x + tile_w_slot0 + tile_w_slot1; the 64 B
pad between slots is write-only padding, never pre-zeroed).
Shared layout: tile_x (2048 B) + tile_w_slot0 (512 B) + tile_w_pad
(64 B, Design invariant #4 bank-phase) + tile_w_slot1 (512 B) =
3136 B. Stride slot0 -> slot1 = 576 B, non-multiple of 128 (no
cross-warp SMEM bank-port contention during overlap).
K-loop body 4-barrier cadence (plan worked-example table):
B1 (post-hoist) INVARIANT #1 — tile_x safe to overwrite
Epoch 1: load W_K(k) -> NEXT slot; mma_Q uses CUR slot
B2 (post-mma_Q) CUR slot safe to overwrite
Epoch 2: load W_V(k) -> CUR slot; mma_K uses NEXT slot
B3 (post-mma_K)
Epoch 3: [overlap loads -> W_Q(k+1) + X(k+1), skipped on final
K-tile via uniform @!p_not_last branch];
mma_V uses CUR slot
B4 (post-mma_V) closes K-tile epoch
Runtime ping-pong slot selection via and.b32 + setp.eq + selp.u32
(slot_idx = k_tile & 1; p_slot0 = slot_idx == 0; cur/next chosen
via selp against r_tile_w_slot{0,1}). Overlap-skip predicate
p_not_last = k_tile + 1 < num_k_tiles; uniform branch, no
divergence.
Total barrier count for 1-K-tile module = 2 setup (pre-zero + B0)
+ 4 per K-tile = 6. 7.3 Design S was 1 + 7 = 8. Per K-tile savings
3 barriers; scales with K.
ptxas --verbose results (S+1/2P, archived R2-5 artifacts):
sm_80: Used 64 registers, 0 bytes spill stores, 0 bytes spill loads
sm_89: Used 56 registers, 0 bytes spill stores, 0 bytes spill loads
vs 7.3 Design-S INT8 post-Rollback-#1 baseline (48 regs on both
architectures): sm_89 +8 regs (hoisted frag_A 2-stripes + slot
selection + overlap-load addr math), sm_80 +16 regs. sm_80 at the
64-reg ceiling exactly; sm_89 has 8 reg headroom. Both architectures
clear the D2 gate. Escape hatch (Gemini R3-4) not needed — ptxas
already placed base pointers in cmem[0] per D1 skeleton evidence.
Test updates:
- build_qkv_project_int8_module_structure now asserts 4 shared decls
(tile_x + slot0 + pad + slot1), new K-loop label
K_LOOP_QKV_INT8_SP_HALF_P, and OVERLAP_SKIP_QKV_INT8_SP_HALF_P label.
- build_qkv_project_int8_module_emits_seven_barriers_per_k_tile
renamed to _emits_four_barriers_per_k_tile_sp_half_p, expects 6
total bar.sync (was 8).
Quality gates green: cargo fmt, cargo clippy -D warnings, cargo test
-p kaio-ops --lib (84 passed, 7 ignored). D3 (GPU e2e regression +
slot-mapping canary + determinism stress) comes next.
…s, 100x bit-exact)
Add two test classes for the S+1/2P kernel beyond the 7.3 regression
suite:
- Slot-mapping canary (2 shapes): deliberately-distinguishable
weights per projection (W_Q=1, W_K=2, W_V=3, X=all-1, scale=1)
so expected output is exactly {K, 2K, 3K} per element. Catches
deterministic ping-pong mis-wiring that random-data tests can
pass by coincidence. Two K-tile variants:
- K=32 (2 K-tiles) exercises the steady -> final transition.
- K=48 (3 K-tiles) exercises steady -> steady -> final, covering
the back-edge transition that the 2-K-tile version doesn't hit.
- Determinism stress (2 shapes): runs the same input 100x back-to-
back in release build and asserts bit-exact output across all
runs (from run 1 - no warmup discards). Catches barrier-
misplacement races that manifest as cross-launch nondeterminism
under varying warp scheduling.
- Short: M=256, N=512, K=1024 (64 K-tiles per launch).
- Prefill: M=2048, N=512, K=4096 (256 K-tiles, the regime that
motivated the S+1/2P rework).
Results on RTX 4090 sm_89:
- 7 existing e2e regression tests: pass (max_rel <= 4.88e-4, most
bit-exact).
- Q/K/V differentiation canary: pass.
- Slot-mapping canary K=32: Q=32 K=64 V=96 exact.
- Slot-mapping canary K=48: Q=48 K=96 V=144 exact.
- Determinism short 100x: bit-exact.
- Determinism prefill 100x: bit-exact.
11/11 INT8 GPU e2e tests pass. No kernel code changed in D3 - only
tests added.
Diagnostic in ptxas_verify: also report allocation at
-maxrregcount 128 to distinguish natural register allocation from
a ptxas compression-under-cap. Confirmed sm_80 64 regs is a genuine
register-reuse compression below the 66-reg natural allocation
(zero spills at default), not a hidden-spill pathology. sm_89 at
56 regs at default vs 66 natural - same pattern, larger headroom.
Quality gates: cargo fmt, cargo clippy -D warnings, cargo test -p
kaio-ops --lib (84 passed, 7 ignored).
…G.yml Sprint 7.3.5 closes with the asymmetric outcome the plan's 5-tier table pre-declared: INT8 ships Design S+1/2P (correctness clean, ptxas clean, abs TOPS baseline captured); INT4 S+1/2P was measured at median prefill_m2048 = 1.05x across 3 runs, improved from Design S's 0.85x but below the 1.15x plan-locked ship gate and the 1.10x ship-narrow floor - landed in "Measured, not shipped". INT4 retained at Design S on main; INT4 S+1/2P port archived on phase7-rest for potential future cp.async contingency work. Docs added / updated: - docs/development/sprints/phase7/sprint_7_3_5.md (new) - terse post-delivery outline with full measured-bench table, 5 architectural decisions, follow-up next steps. - docs/development/sprints/phase7/phase7_master_plan.md - status header, section 4 (Quant+Attention integration) rewritten for 7.3 / 7.3.5 outcome, Sprint Breakdown table rows added, dependency graph updated, Success Criterion 4 marked complete. - docs/development/sprints/phase7/PHASE_7_LOG.md - 7.3.5 status row + ops-shipped table notes. - CHANGELOG.md - Sprint 7.3.5 section under [Unreleased]. - README.md - INT8 row updated for S+1/2P; INT4 row explicitly notes Design S retained and reason. - kaio-ops/src/qkv_project_int8_kernel.rs - module-level rustdoc refreshed (Tri-output design and Register budget sections) to reflect post-Rollback-#1 + Sprint 7.3.5 S+1/2P state. Editorial polish pass across phase 1 / 4 / 6 / 7 sprint docs and across kaio-ops / kaio test-file comments - tightening voice for outward-facing readers. Technical rationale and measured results unchanged. .github/FUNDING.yml added for the Sponsor button - lands on main via the phase7-ship merge. Quality gates (cargo fmt --check, clippy --workspace --all-targets -D warnings, doc --workspace --no-deps, test --workspace) all green. ptxas_verify green on sm_80 + sm_89 for INT8 S+1/2P (64 / 56 regs, 0 spills) and INT4 Design S (56 / 56 regs, 0 spills).
5-shape abs-TOPS comparison, same session on RTX 4090 sm_89, kernel file swapped between 2dbad72 (Design-S) and HEAD (S+1/2P) via checkout-revert, identical thermal state both sides. | shape | Design-S | S+1/2P | speedup | |------------------|----------|--------|---------| | decode_m1 | 3.2 | 3.3 | 1.04x | | decode_m64 | 3.2 | 3.3 | 1.04x | | decode_m64_large | 9.8 | 10.2 | 1.04x | | prefill_m512 | 31.4 | 35.6 | 1.13x | | prefill_m2048 | 38.8 | 44.6 | 1.15x | S+1/2P wins across all 5 shapes. Gain scales with K-tile count per launch: decode (few K-tiles, barriers not dominant) gains ~4%; prefill_m2048 (256 K-tiles, barriers dominant in Design-S) gains 15% - exactly the threshold the sprint plan set for INT4, which INT4 narrowly missed at 1.05x vs 3x standalone matmul_int4. Different outcome between the variants at the same shape localises the INT4 gap: the INT8 path recovers cleanly under barrier reduction; the INT4 path (per-column f16 scale hoist + nibble- extract dequant) exposes memory-issue latency as the new binding constraint once barriers no longer dominate. cp.async targets that class of stall and is the right candidate for a follow-up contingency sprint - not as an alternative to S+1/2P but as the next bottleneck after S+1/2P succeeds. - sprint_7_3_5.md Performance section: INT8 table replaced with the like-for-like comparison + interpretation paragraph tying the INT8/INT4 asymmetry to the cp.async follow-up. - PHASE_7_LOG.md 7.3.5 headline: added the 1.15x like-for-like result alongside the INT4 Measured-not-shipped status.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.