Skip to content

devesh-shetty/1bit-inference-engine

Repository files navigation

1-Bit Inference Engine

A ternary ({-1, 0, +1}) inference engine built from scratch in C, targeting Apple Silicon NEON. Includes optimised matmul kernels and a complete transformer forward pass. Companion code for the blog post "What If Neural Networks Don't Need Floating Point?".

Results (Apple M3 Pro)

Kernel Benchmark (2048×2048 MatVec)

Kernel µs GOPS vs NEON FP32 vs Apple BLAS
Apple Accelerate BLAS 74-95 44-57 1.0x
FP32 NEON (hand-written) 240-250 16.8 1.0x 0.3x
Ternary LUT (lossless) 277 15.1 0.9x 0.3x
Ternary INT8 SDOT 44-46 91-95 5.7x 1.7x

Transformer Forward Pass (Synthetic Benchmark)

Random ternary weights, vocab=1000. Benchmarks throughput, not a trained model.

Model Layers Dim Ternary Params ms/token tok/sec
Small 6 512 14.7M 0.24 4,175
Large 12 2048 472M 7.7 130

The SDOT kernel reads INT8-expanded weights (~472 MB for the large config). The benchmark binary also keeps packed bitmasks resident for V1 (~118 MB), so total live ternary storage is ~590 MB. FP32 equivalent would be ~1.9 GB.

The main win is bandwidth, not just cheaper arithmetic: once matrices escape cache, moving 4x less weight data matters more than saving individual multiplies.

Batch GEMM (Prefill)

Batch µs/row GOPS
1 44 95
8 44 95

Per-row cost is constant: weight bandwidth dominates. ARM I8MM (vmmlaq_s32) was tested but was 4x slower than SDOT due to tile-construction overhead — instruction width doesn't help when data layout doesn't match.

Architecture

Matmul Kernels

Three kernel variants, each optimised for a different tradeoff:

  1. Naive (ternary_matvec_naive): Branch-per-weight reference.
  2. Packed/LUT (ternary_matvec_packed): Bitmask storage + 256-entry sign LUT + NEON FMLA. Lossless FP32 precision, 16x weight compression.
  3. SDOT (ternary_matvec_simd): INT8 weights in 4-row interleaved layout + NEON SDOT. 5.7x FP32, matches BitNet's INT8 activation precision. Runtime weight footprint is 1 byte/weight (4x compression vs FP32), and the lack of decode overhead is why it beats the more aggressively compressed LUT path.

Transformer

Complete transformer forward pass with:

  • RMSNorm (NEON-accelerated)
  • Rotary Position Embeddings (RoPE)
  • Multi-head attention with KV cache
  • Feed-forward network with ReLU² activation
  • All linear layers use ternary SDOT kernel

Building

make          # builds bench_bin
make bench    # builds and runs

Requires: Apple Silicon (M1+) or any ARMv8.2+ with DOTPROD. Falls back to scalar on other platforms. Uses Apple Accelerate for BLAS comparison on macOS.

Key Optimisations (100 experiments)

100 experiments across 6 sessions improved the matmul kernel from 3,693µs to 44µs (84x). Key techniques:

  1. Sequential memory access beats random gather (3.5x)
  2. 256-entry sign LUT in L1 beats arithmetic (1.1x)
  3. Full unroll (16 macros per 64-bit word) eliminates loop overhead (1.5x)
  4. Branchless inner loop (zero-multiply is free) (1.4x)
  5. INT8 SDOT processes 16 MACs per instruction (4x)
  6. 4-row output tiling shares activation loads (1.3x)
  7. Interleaved weight layout makes all reads sequential (1.1x)

Files

  • ternary.h / ternary.c — Matmul kernels, quantisation, weight packing
  • transformer.h / transformer.c — Transformer forward pass, RMSNorm, RoPE, attention
  • bench.c — Benchmark harness, correctness checks, profiling, BLAS comparison, GEMM benchmark
  • Makefile — Build configuration

Dead Ends (25+ confirmed)

TBL 2-bit expansion, FP16 weights, I8MM vmmlaq, PGO, LTO effects, arithmetic sign vectors, vbslq masked ops, group sparsity, cache-line alignment, explicit prefetch, restrict qualifiers, __attribute__((hot)), wider inner-loop unrolls. The kernel achieves 74.5% of the M3 Pro's theoretical INT8 SDOT peak — confirmed at hardware ceiling via roofline analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages