All optimization paths exhausted. Engine at 74.5% of M3 Pro hardware peak. No further improvement possible without different hardware or algorithm.
- Matmul: 44µs = 95 GOPS = 84x from baseline
- Transformer: ~125 tok/sec on 472M params
- Weight compression: 16x
- vs Apple BLAS: 1.7x faster
- Code: 1,596 lines of C