A ternary ({-1, 0, +1}) inference engine built from scratch in C, targeting Apple Silicon NEON. Includes optimised matmul kernels and a complete transformer forward pass. Companion code for the blog post "What If Neural Networks Don't Need Floating Point?".
| Kernel | µs | GOPS | vs NEON FP32 | vs Apple BLAS |
|---|---|---|---|---|
| Apple Accelerate BLAS | 74-95 | 44-57 | — | 1.0x |
| FP32 NEON (hand-written) | 240-250 | 16.8 | 1.0x | 0.3x |
| Ternary LUT (lossless) | 277 | 15.1 | 0.9x | 0.3x |
| Ternary INT8 SDOT | 44-46 | 91-95 | 5.7x | 1.7x |
Random ternary weights, vocab=1000. Benchmarks throughput, not a trained model.
| Model | Layers | Dim | Ternary Params | ms/token | tok/sec |
|---|---|---|---|---|---|
| Small | 6 | 512 | 14.7M | 0.24 | 4,175 |
| Large | 12 | 2048 | 472M | 7.7 | 130 |
The SDOT kernel reads INT8-expanded weights (~472 MB for the large config). The benchmark binary also keeps packed bitmasks resident for V1 (~118 MB), so total live ternary storage is ~590 MB. FP32 equivalent would be ~1.9 GB.
The main win is bandwidth, not just cheaper arithmetic: once matrices escape cache, moving 4x less weight data matters more than saving individual multiplies.
| Batch | µs/row | GOPS |
|---|---|---|
| 1 | 44 | 95 |
| 8 | 44 | 95 |
Per-row cost is constant: weight bandwidth dominates. ARM I8MM (vmmlaq_s32) was tested but was 4x slower than SDOT due to tile-construction overhead — instruction width doesn't help when data layout doesn't match.
Three kernel variants, each optimised for a different tradeoff:
- Naive (
ternary_matvec_naive): Branch-per-weight reference. - Packed/LUT (
ternary_matvec_packed): Bitmask storage + 256-entry sign LUT + NEON FMLA. Lossless FP32 precision, 16x weight compression. - SDOT (
ternary_matvec_simd): INT8 weights in 4-row interleaved layout + NEON SDOT. 5.7x FP32, matches BitNet's INT8 activation precision. Runtime weight footprint is 1 byte/weight (4x compression vs FP32), and the lack of decode overhead is why it beats the more aggressively compressed LUT path.
Complete transformer forward pass with:
- RMSNorm (NEON-accelerated)
- Rotary Position Embeddings (RoPE)
- Multi-head attention with KV cache
- Feed-forward network with ReLU² activation
- All linear layers use ternary SDOT kernel
make # builds bench_bin
make bench # builds and runsRequires: Apple Silicon (M1+) or any ARMv8.2+ with DOTPROD. Falls back to scalar on other platforms. Uses Apple Accelerate for BLAS comparison on macOS.
100 experiments across 6 sessions improved the matmul kernel from 3,693µs to 44µs (84x). Key techniques:
- Sequential memory access beats random gather (3.5x)
- 256-entry sign LUT in L1 beats arithmetic (1.1x)
- Full unroll (16 macros per 64-bit word) eliminates loop overhead (1.5x)
- Branchless inner loop (zero-multiply is free) (1.4x)
- INT8 SDOT processes 16 MACs per instruction (4x)
- 4-row output tiling shares activation loads (1.3x)
- Interleaved weight layout makes all reads sequential (1.1x)
ternary.h/ternary.c— Matmul kernels, quantisation, weight packingtransformer.h/transformer.c— Transformer forward pass, RMSNorm, RoPE, attentionbench.c— Benchmark harness, correctness checks, profiling, BLAS comparison, GEMM benchmarkMakefile— Build configuration
TBL 2-bit expansion, FP16 weights, I8MM vmmlaq, PGO, LTO effects, arithmetic sign vectors, vbslq masked ops, group sparsity, cache-line alignment, explicit prefetch, restrict qualifiers, __attribute__((hot)), wider inner-loop unrolls. The kernel achieves 74.5% of the M3 Pro's theoretical INT8 SDOT peak — confirmed at hardware ceiling via roofline analysis.