1-Bit Inference Engine

A ternary ({-1, 0, +1}) inference engine built from scratch in C, targeting Apple Silicon NEON. Includes optimised matmul kernels and a complete transformer forward pass. Companion code for the blog post "What If Neural Networks Don't Need Floating Point?".

Results (Apple M3 Pro)

Kernel Benchmark (2048×2048 MatVec)

Kernel	µs	GOPS	vs NEON FP32	vs Apple BLAS
Apple Accelerate BLAS	74-95	44-57	—	1.0x
FP32 NEON (hand-written)	240-250	16.8	1.0x	0.3x
Ternary LUT (lossless)	277	15.1	0.9x	0.3x
Ternary INT8 SDOT	44-46	91-95	5.7x	1.7x

Transformer Forward Pass (Synthetic Benchmark)

Random ternary weights, vocab=1000. Benchmarks throughput, not a trained model.

Model	Layers	Dim	Ternary Params	ms/token	tok/sec
Small	6	512	14.7M	0.24	4,175
Large	12	2048	472M	7.7	130

The SDOT kernel reads INT8-expanded weights (~472 MB for the large config). The benchmark binary also keeps packed bitmasks resident for V1 (~118 MB), so total live ternary storage is ~590 MB. FP32 equivalent would be ~1.9 GB.

The main win is bandwidth, not just cheaper arithmetic: once matrices escape cache, moving 4x less weight data matters more than saving individual multiplies.

Batch GEMM (Prefill)

Batch	µs/row	GOPS
1	44	95
8	44	95

Per-row cost is constant: weight bandwidth dominates. ARM I8MM (vmmlaq_s32) was tested but was 4x slower than SDOT due to tile-construction overhead — instruction width doesn't help when data layout doesn't match.

Architecture

Matmul Kernels

Three kernel variants, each optimised for a different tradeoff:

Naive (ternary_matvec_naive): Branch-per-weight reference.
Packed/LUT (ternary_matvec_packed): Bitmask storage + 256-entry sign LUT + NEON FMLA. Lossless FP32 precision, 16x weight compression.
SDOT (ternary_matvec_simd): INT8 weights in 4-row interleaved layout + NEON SDOT. 5.7x FP32, matches BitNet's INT8 activation precision. Runtime weight footprint is 1 byte/weight (4x compression vs FP32), and the lack of decode overhead is why it beats the more aggressively compressed LUT path.

Transformer

Complete transformer forward pass with:

RMSNorm (NEON-accelerated)
Rotary Position Embeddings (RoPE)
Multi-head attention with KV cache
Feed-forward network with ReLU² activation
All linear layers use ternary SDOT kernel

Building

make          # builds bench_bin
make bench    # builds and runs

Requires: Apple Silicon (M1+) or any ARMv8.2+ with DOTPROD. Falls back to scalar on other platforms. Uses Apple Accelerate for BLAS comparison on macOS.

Key Optimisations (100 experiments)

100 experiments across 6 sessions improved the matmul kernel from 3,693µs to 44µs (84x). Key techniques:

Sequential memory access beats random gather (3.5x)
256-entry sign LUT in L1 beats arithmetic (1.1x)
Full unroll (16 macros per 64-bit word) eliminates loop overhead (1.5x)
Branchless inner loop (zero-multiply is free) (1.4x)
INT8 SDOT processes 16 MACs per instruction (4x)
4-row output tiling shares activation loads (1.3x)
Interleaved weight layout makes all reads sequential (1.1x)

Files

ternary.h / ternary.c — Matmul kernels, quantisation, weight packing
transformer.h / transformer.c — Transformer forward pass, RMSNorm, RoPE, attention
bench.c — Benchmark harness, correctness checks, profiling, BLAS comparison, GEMM benchmark
Makefile — Build configuration

Dead Ends (25+ confirmed)

TBL 2-bit expansion, FP16 weights, I8MM vmmlaq, PGO, LTO effects, arithmetic sign vectors, vbslq masked ops, group sparsity, cache-line alignment, explicit prefetch, restrict qualifiers, __attribute__((hot)), wider inner-loop unrolls. The kernel achieves 74.5% of the M3 Pro's theoretical INT8 SDOT peak — confirmed at hardware ceiling via roofline analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
autoresearch.ideas.md		autoresearch.ideas.md
autoresearch.jsonl		autoresearch.jsonl
autoresearch.md		autoresearch.md
autoresearch.sh		autoresearch.sh
bench.c		bench.c
ternary.c		ternary.c
ternary.h		ternary.h
transformer.c		transformer.c
transformer.h		transformer.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1-Bit Inference Engine

Results (Apple M3 Pro)

Kernel Benchmark (2048×2048 MatVec)

Transformer Forward Pass (Synthetic Benchmark)

Batch GEMM (Prefill)

Architecture

Matmul Kernels

Transformer

Building

Key Optimisations (100 experiments)

Files

Dead Ends (25+ confirmed)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1-Bit Inference Engine

Results (Apple M3 Pro)

Kernel Benchmark (2048×2048 MatVec)

Transformer Forward Pass (Synthetic Benchmark)

Batch GEMM (Prefill)

Architecture

Matmul Kernels

Transformer

Building

Key Optimisations (100 experiments)

Files

Dead Ends (25+ confirmed)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages