Production-grade async AI inference server in Rust — real LLM inference via HuggingFace Candle, dynamic request batching, SSE token streaming, multi-model registry, Prometheus metrics, a Rayon worker pool, and TurboQuant KV cache compression for 7× longer context windows.
axon ships with a three-tier inference engine that degrades gracefully:
| Tier | When active | Behavior |
|---|---|---|
| 1 — Candle + configured model | hf_repo set in config + model reachable |
Real LLM inference via HuggingFace Candle (Metal / CUDA / CPU) |
| 2 — Candle + open-model autodiscovery | hf_repo empty, network available |
Downloads SmolLM2-1.7B or Mistral-7B automatically (no token required) |
| 3 — ndarray simulation | No network / no GPU / unsupported arch | Original fast simulation — work ∝ dim² × max_tokens; server always starts |
The server always starts. Tier selection is logged at startup. Every layer below the engine — batching, back-pressure, streaming, arena memory, HTTP routes — is unchanged across all three tiers.
- Hardware requirements and model selection
- HuggingFace account setup
- Quickstart — simulation mode (zero setup)
- Real inference — open models (no token)
- Real inference — gated models (Llama 2 / 3)
- Configuration reference
- API reference
- Runtime model loading
- Use cases
- Workspace layout
- TurboQuant KV cache compression
- Architecture decisions
- Key concepts by location
- Running tests and benchmarks
- Docker
- AWS deployment
- Safety
- Known limitations
To run simulation mode (Tier 3) — which requires nothing beyond a modern laptop:
- Any CPU
- 512 MB RAM
- No GPU required
- No network required
| Model | Parameters | Disk (f16) | RAM needed | Minimum hardware | Recommended |
|---|---|---|---|---|---|
| SmolLM2-1.7B | 1.7 B | 3.4 GB | 4 GB | Any modern laptop (2019+) | Best starting point |
| Phi-3 Mini | 3.8 B | 7.6 GB | 9 GB | 16 GB RAM laptop | Strong reasoning for size |
| Mistral 7B | 7 B | 14 GB | 16 GB | 32 GB RAM or GPU | Best open 7B |
| Model | Parameters | Disk (f16) | RAM needed | Minimum hardware | Notes |
|---|---|---|---|---|---|
| Llama 3.2 3B | 3 B | 6 GB | 8 GB | 16 GB RAM | Best Llama for dev machines |
| Llama 2 7B | 7 B | 13.5 GB | 16 GB | 32 GB RAM or 8 GB VRAM | Classic baseline |
| Llama 3 8B | 8 B | 16 GB | 18 GB | 32 GB RAM or 10 GB VRAM | Best open-weights 8B |
| Llama 2 13B | 13 B | 26 GB | 28 GB | 32 GB RAM + Metal/CUDA | High quality, heavy |
| Llama 3 70B | 70 B | 140 GB (f16) / ~40 GB (Q4) | 48 GB+ | Mac Studio M2 Ultra 192 GB or A100 | Not recommended for dev |
Apple Silicon has unified memory shared between CPU and GPU. There is no separate VRAM limit — the model lives in the same pool as the OS and other apps. Use device = "metal" for GPU acceleration via Candle's Metal backend.
| Chip | Total memory | What fits comfortably | Recommended model |
|---|---|---|---|
| M1 / M2 8 GB | 8 GB | SmolLM2-1.7B (f16) only | SmolLM2-1.7B |
| M1 / M2 16 GB | 16 GB | SmolLM2, Phi-3, Llama 3.2 3B | Llama 3.2 3B |
| M1 Pro / M2 Pro 32 GB | 32 GB | Any 7B f16, Llama 3 8B | Llama 2 7B or Mistral 7B |
| M2 Max / M3 Max 64 GB | 64 GB | Llama 3 8B, 13B | Llama 3 8B or 13B |
| M4 Max 128 GB | 128 GB | Everything up to 70B Q4 | Llama 3 8B recommended for daily use |
| M2 Ultra 192 GB | 192 GB | 70B models in f16 | Llama 3 70B |
Expected inference speed on Apple Silicon with Metal (f16):
| Model | M1 16 GB | M2 Pro 32 GB | M3 Max 64 GB | M4 Max 128 GB |
|---|---|---|---|---|
| SmolLM2-1.7B | ~50 tok/s | ~70 tok/s | ~80 tok/s | ~90 tok/s |
| Llama 3.2 3B | — | ~45 tok/s | ~55 tok/s | ~60 tok/s |
| Llama 2 7B | — | ~22 tok/s | ~28 tok/s | ~30 tok/s |
| Llama 3 8B | — | ~18 tok/s | ~22 tok/s | ~25 tok/s |
| GPU | VRAM | What fits (f16) | Notes |
|---|---|---|---|
| RTX 3060 / 4060 | 12 GB | SmolLM2, Phi-3, Llama 3.2 3B | Good dev GPU |
| RTX 3090 / 4090 | 24 GB | Any 7B / 8B f16 | Ideal workstation |
| A10G (AWS g5) | 24 GB | Any 7B / 8B f16 | Cloud standard |
| A100 40 GB | 40 GB | Llama 3 8B, Llama 2 13B | Production cloud |
| A100 80 GB | 80 GB | Llama 2 70B f16 (barely) | Production cloud |
| H100 80 GB | 80 GB | Llama 2 70B f16 | Fastest available |
To use CUDA, change device = "cuda:0" in the model config and add features = ["cuda"] to the candle-core workspace dep instead of metal.
CPU inference works with no feature changes. Use device = "cpu" and prefer smaller models. f32 is faster than f16 on CPU because f16 operations are emulated:
| Model | RAM | Speed on modern 8-core CPU | Practical? |
|---|---|---|---|
| SmolLM2-1.7B (f32) | 7 GB | ~4 tok/s | Yes — usable for dev |
| Phi-3-mini (f32) | 15 GB | ~2 tok/s | Slow but works |
| Llama 2 7B (f32) | 28 GB | ~0.5 tok/s | Very slow; use f16 or quantized |
The following models require no HuggingFace account, no token:
HuggingFaceTB/SmolLM2-1.7B-Instruct— Apache 2.0microsoft/Phi-3-mini-4k-instruct— MITmistralai/Mistral-7B-Instruct-v0.3— Apache 2.0
Just set hf_repo in your config and run. Files download on first startup.
Models from Meta (Llama family) and Google (Gemma) require three steps:
Step 1 — Create a HuggingFace account
Go to huggingface.co and sign up. Free accounts work.
Step 2 — Accept the model license
Visit the model card and click "Agree and access repository":
| Model | License page |
|---|---|
| Llama 3.2 3B | huggingface.co/meta-llama/Llama-3.2-3B-Instruct |
| Llama 3 8B | huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct |
| Llama 2 7B | huggingface.co/meta-llama/Llama-2-7b-chat-hf |
| Llama 2 13B | huggingface.co/meta-llama/Llama-2-13b-chat-hf |
| Gemma 7B | huggingface.co/google/gemma-7b-it |
Approval is usually instant for Llama 3.x. Llama 2 may take a few minutes. You receive a confirmation email.
Step 3 — Create a read access token
- Go to
huggingface.co/settings/tokens - Click New token
- Type: Read
- Name it anything (e.g.,
axon-local) - Copy the
hf_...value
Step 4 — Set the token in your environment
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAdd to ~/.zshrc or ~/.bashrc to persist across terminal sessions:
echo 'export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' >> ~/.zshrc
source ~/.zshrcNever commit the token to source control. axon reads it from the environment — never from config files.
Models download once to ~/.cache/huggingface/hub/. Subsequent server startups load from cache instantly.
# Check what has been downloaded
du -sh ~/.cache/huggingface/hub/
# List cached model snapshots
ls ~/.cache/huggingface/hub/models--*/snapshots/main/
# Remove a cached model to free disk space
rm -rf ~/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-1.7B-Instruct/
# Move cache to a different drive (e.g., external SSD)
export HF_HOME=/Volumes/SSD/hf-cacheApproximate download times on a 100 Mbps connection:
| Model | Size | Download time |
|---|---|---|
| SmolLM2-1.7B | 3.4 GB | ~5 min |
| Phi-3 Mini 3.8B | 7.6 GB | ~10 min |
| Llama 3.2 3B | 6 GB | ~8 min |
| Mistral 7B / Llama 2 7B | 14 GB | ~20 min |
| Llama 3 8B | 16 GB | ~22 min |
No GPU, no internet, no token required. Works on any machine.
Prerequisites: Rust 1.75+ (rustup update stable)
git clone https://github.com/dattgoswami/axon
cd axon
cargo run --release -p axon-serverExpected startup output:
WARN axon_server: HF_TOKEN not set — gated models (Llama 2/3/3.2) will fail; open models autodiscovered automatically
WARN axon_worker::engine: real inference unavailable — using ndarray simulation, model_id: default, error: all autodiscovery candidates unreachable — falling back to simulation
INFO axon_server: axon starting port=3000
INFO axon_server: listening on 0.0.0.0:3000
The WARN messages are expected — axon is falling back to simulation because there's no network or token configured. The server is fully functional.
# Health check
curl -s http://localhost:3000/v1/health | jq
# {
# "status": "ok",
# "models_loaded": 1,
# "requests_total": 0,
# "requests_active": 0
# }
# Non-streaming inference (model_id "default" is pre-loaded)
curl -s -X POST http://localhost:3000/v1/generate \
-H 'Content-Type: application/json' \
-d '{
"id": "00000000-0000-0000-0000-000000000001",
"model_id": "default",
"prompt": "Hello, axon",
"max_tokens": 20,
"temperature": 0.7,
"stream": false
}' | jq
# {
# "id": "00000000-0000-0000-0000-000000000001",
# "model_id": "default",
# "text": "[axon/sim] generated 20 tokens for model 'default' (dim=64)",
# "tokens_generated": 20,
# "latency_ms": 0
# }
# SSE streaming
curl -N -X POST http://localhost:3000/v1/generate/stream \
-H 'Content-Type: application/json' \
-d '{
"id": "00000000-0000-0000-0000-000000000002",
"model_id": "default",
"prompt": "Stream me tokens",
"max_tokens": 5,
"temperature": 0.0,
"stream": true
}'
# data: token_0
# data: token_1
# data: token_2
# data: token_3
# data: token_4
# event: done
# data:
# Prometheus metrics
curl -s http://localhost:3000/v1/metricsBest choice for first real inference. Small enough to fit on any 8 GB machine, fast on Apple Silicon or any GPU, no token required.
Step 1 — Edit config/default.toml
Uncomment and save the SmolLM2 block:
[[models]]
id = "smollm2"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
revision = "main"
dtype = "f16"
device = "metal" # Apple Silicon GPU — change to "cpu" if no Metal
dim = 2048
simulated_latency_ms = 0
max_rps = 0For CPU-only machines (Intel Mac, Linux without GPU, Windows):
device = "cpu"
dtype = "f32" # f32 is faster than f16 on CPUStep 2 — Run
cargo run --release -p axon-server
# First run: downloads 3.4 GB to ~/.cache/huggingface/hub/
# Subsequent runs: loads from cache in ~3-5 secondsExpected startup output (first run):
WARN axon_server: HF_TOKEN not set — ...open models autodiscovered automatically
INFO axon_worker::engine: real inference engine ready, model_id: smollm2, hf_repo: HuggingFaceTB/SmolLM2-1.7B-Instruct, device: metal
INFO axon_server: axon starting port=3000
INFO axon_server: listening on 0.0.0.0:3000
Step 3 — Test
curl -s -X POST http://localhost:3000/v1/generate \
-H 'Content-Type: application/json' \
-d '{
"id": "00000000-0000-0000-0000-000000000003",
"model_id": "smollm2",
"prompt": "Explain Rust ownership in one sentence:",
"max_tokens": 80,
"temperature": 0.3,
"stream": false
}' | jqBetter reasoning than SmolLM2 at the cost of 2x the size. Good for 16 GB+ machines.
Edit config/default.toml:
[[models]]
id = "phi3"
hf_repo = "microsoft/Phi-3-mini-4k-instruct"
revision = "main"
dtype = "f16"
device = "metal"
dim = 3072
simulated_latency_ms = 0
max_rps = 0Note: Phi-3 uses a different model architecture (phi3) than Llama. The current engine supports Llama-compatible architectures only. If config.json parsing fails (error: non-Llama arch), axon falls back to simulation and logs a WARN. Phi-3 support requires adding the candle_transformers::models::phi3 code path to engine.rs — see Extending axon — adding new architectures.
Best fully open 7B model. Llama-compatible architecture — works with the current engine. Requires 16 GB+ RAM or a GPU with 14 GB VRAM.
Edit config/default.toml:
[[models]]
id = "mistral-7b"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
revision = "main"
dtype = "f16"
device = "metal"
dim = 4096
simulated_latency_ms = 0
max_rps = 0You can serve multiple models at once. Add multiple [[models]] blocks:
[[models]]
id = "fast"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
dtype = "f16"
device = "metal"
dim = 2048
[[models]]
id = "quality"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
dtype = "f16"
device = "metal"
dim = 4096Route requests to different models by model_id:
# Fast model for autocomplete
curl -X POST http://localhost:3000/v1/generate \
-d '{"id":"...","model_id":"fast","prompt":"Complete: def fibonacci(","max_tokens":20,"temperature":0.1,"stream":false}'
# Quality model for complex tasks
curl -X POST http://localhost:3000/v1/generate \
-d '{"id":"...","model_id":"quality","prompt":"Explain gradient descent:","max_tokens":200,"temperature":0.5,"stream":false}'These require the HF_TOKEN to be set. See HuggingFace account setup first.
Best balance of quality and size for developer machines. Fits in 8 GB with headroom.
Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Llama-3.2-3B-Instruct
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxEdit config/default.toml:
[[models]]
id = "llama3-3b"
hf_repo = "meta-llama/Llama-3.2-3B-Instruct"
revision = "main"
dtype = "f16"
device = "metal"
dim = 3072
simulated_latency_ms = 0
max_rps = 0Run:
cargo run --release -p axon-server
# First run: downloads ~6 GBInference with Llama 3 chat template:
Llama 3 uses a special chat format with control tokens. The prompt must include the correct template for instruct models:
curl -s -X POST http://localhost:3000/v1/generate \
-H 'Content-Type: application/json' \
-d '{
"id": "00000000-0000-0000-0000-000000000004",
"model_id": "llama3-3b",
"prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is ownership in Rust?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"max_tokens": 200,
"temperature": 0.7,
"stream": false
}' | jq .textThe classic Meta 7B model. Higher quality than Llama 3.2 3B for complex tasks but requires more RAM.
Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Llama-2-7b-chat-hf
[[models]]
id = "llama2-7b"
hf_repo = "meta-llama/Llama-2-7b-chat-hf"
revision = "main"
dtype = "f16"
device = "metal"
dim = 4096
simulated_latency_ms = 0
max_rps = 0Inference with Llama 2 chat template:
curl -s -X POST http://localhost:3000/v1/generate \
-H 'Content-Type: application/json' \
-d '{
"id": "00000000-0000-0000-0000-000000000005",
"model_id": "llama2-7b",
"prompt": "[INST] Explain Rust lifetimes as if I am a Java engineer [/INST]",
"max_tokens": 300,
"temperature": 0.7,
"stream": false
}' | jq .textLatest generation, best capability-per-parameter open-weights model. Requires 18 GB RAM or a GPU with 16 GB VRAM. ~25 tok/s on M4 Max.
Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
[[models]]
id = "llama3-8b"
hf_repo = "meta-llama/Meta-Llama-3-8B-Instruct"
revision = "main"
dtype = "f16"
device = "metal"
dim = 4096
simulated_latency_ms = 0
max_rps = 0Use the same Llama 3 chat template as above (<|begin_of_text|>...).
For machines with 32 GB+ RAM. Noticeably better output quality than 7B for multi-step reasoning.
[[models]]
id = "llama2-13b"
hf_repo = "meta-llama/Llama-2-13b-chat-hf"
revision = "main"
dtype = "f16"
device = "metal"
dim = 5120
simulated_latency_ms = 0
max_rps = 0[server]
port = 3000
# Timeout in ms before returning 504 to the caller.
# Increase for large models: Llama 3 8B generating 512 tokens takes ~20 seconds.
request_timeout_ms = 30000
[batching]
# Flush when this many requests accumulate (biased toward full batches under load).
# For real models, keep this low (2–8) — GPU inference is compute-bound, not I/O-bound.
max_batch_size = 4
# Flush after this many ms regardless of batch fill.
max_wait_ms = 5
[worker_pool]
# Rayon CPU threads. 0 = num_cpus. Real inference on GPU rarely benefits from > 1 thread.
threads = 0
[kv_cache]
# Pre-allocated f32 slots for the ndarray simulation arena (Tier 3 only).
# 1_048_576 = 4 MB dev default. 67_108_864 = 256 MB for heavy simulation load.
# Real Candle inference ignores this value (Candle manages its own tensor memory).
capacity_f32_slots = 1048576
[kv_quantization]
# TurboQuant KV cache compression. Disabled by default.
# When enabled, the QuantizedKvArena is sized to the same byte budget as the f32 arena
# (capacity_f32_slots × 4 bytes), but stores ~7× more tokens at 4-bit.
enabled = false
bits = 4 # 4 = zero quality loss (paper result); 3 = 9× compression, minor degradation
# ── [[models]] — one block per model to pre-load at startup ──────────────────
#
# All fields except `id` have defaults; you only need to set what differs
# from the default.
#
# Field descriptions:
# id Unique name; used as model_id in API requests
# hf_repo HuggingFace repo string (e.g. "meta-llama/Llama-2-7b-hf")
# Empty string = autodiscover open model (Tier 2)
# revision Git branch, tag, or commit hash. Default: "main"
# dtype Weight precision. "f16" (default) | "bf16" | "f32"
# f16: best for Metal/CUDA. f32: better for CPU-only.
# device "metal" (Apple GPU, default on macOS)
# "cpu" (any machine, slower)
# "cuda:N" (NVIDIA GPU N, Linux/Windows)
# dim Hidden dimension from model config.json (hidden_size).
# Unused by real inference; only affects Tier 3 simulation cost.
# simulated_latency_ms Artificial delay per batch in ms. Set to 0 for real inference.
# max_rps Max requests/second for this model. 0 = unlimited.All config fields can be overridden at runtime with AXON__<SECTION>__<KEY>:
AXON__SERVER__PORT=8080
AXON__SERVER__REQUEST_TIMEOUT_MS=60000
AXON__BATCHING__MAX_BATCH_SIZE=8
AXON__BATCHING__MAX_WAIT_MS=10
AXON__WORKER_POOL__THREADS=4
AXON__KV_CACHE__CAPACITY_F32_SLOTS=67108864HF_TOKEN is not read from config/default.toml — always pass it as a shell environment variable.
Apple Silicon (Metal)
[server]
request_timeout_ms = 30000 # 30 s is safe for 7B models
[batching]
max_batch_size = 4 # Metal serialises GPU work; small batches reduce latency
max_wait_ms = 5NVIDIA GPU (CUDA)
[batching]
max_batch_size = 16 # GPUs parallelise batch dimensions efficiently
max_wait_ms = 10CPU only
[server]
request_timeout_ms = 120000 # CPU inference is slow; allow 2 minutes for large models
[batching]
max_batch_size = 1 # One at a time; CPU can't parallelise model layers
max_wait_ms = 1| Method | Path | Body | Success |
|---|---|---|---|
GET |
/v1/health |
— | 200 HealthResponse |
POST |
/v1/generate |
GenerateRequest |
200 InferenceResponse |
POST |
/v1/generate/stream |
GenerateRequest |
200 SSE stream |
GET |
/v1/models |
— | 200 [ModelConfig] |
POST |
/v1/models/load |
ModelConfig |
201 Created |
GET |
/v1/metrics |
— | 200 Prometheus text |
{
"id": "UUID v4 — unique per request",
"model_id": "string — must match a loaded model",
"prompt": "string — 1 to 32 768 characters",
"max_tokens": "integer — 1 to 4096",
"temperature": "float — 0.0 (greedy/deterministic) to 2.0 (highly random)",
"stream": "boolean — true for SSE token stream, false for full response"
}temperature = 0.0 uses argmax (greedy decoding) — deterministic, best for code or factual answers.
temperature = 0.7 is a good default for natural language.
temperature > 1.0 produces increasingly random output.
{
"id": "UUID — echoed from request",
"model_id": "string",
"text": "string — generated text (empty on engine error)",
"tokens_generated": "integer — actual tokens produced (may be < max_tokens if EOS hit)",
"latency_ms": "integer — wall time from engine start to response"
}All fields except id have defaults — you only need to set fields that differ:
{
"id": "string — unique model name (required)",
"hf_repo": "string — HuggingFace repo (default: empty = autodiscover)",
"revision": "string — git ref (default: main)",
"dtype": "string — f16 | bf16 | f32 (default: f16)",
"device": "string — metal | cpu | cuda:N (default: metal on macOS, cpu elsewhere)",
"dim": "integer — hidden_size from config.json (default: 0; unused by real engine)",
"simulated_latency_ms": "integer — artificial delay, ms (default: 0)",
"max_rps": "integer — rate cap, 0 = unlimited (default: 0)"
}{
"status": "ok",
"models_loaded": 2,
"requests_total": 1024,
"requests_active": 3
}data: Once
data: upon
data: a
data: time
event: done
data:
Each data: line is one decoded token string. The final event: done signals end of generation.
| Status | Condition |
|---|---|
404 Not Found |
model_id not in registry |
422 Unprocessable Entity |
Request validation failed (prompt too long, max_tokens out of range, etc.) |
429 Too Many Requests |
max_rps exceeded for this model |
503 Service Unavailable |
Internal dispatch channel full — server overloaded |
504 Gateway Timeout |
Inference did not complete within request_timeout_ms |
In addition to pre-configuring models in config/default.toml, you can load models dynamically at runtime via the API. The engine initialisation (including any weight download) runs on a background thread so the POST returns immediately with 201 Created.
# Load SmolLM2 at runtime (no token needed)
curl -X POST http://localhost:3000/v1/models/load \
-H 'Content-Type: application/json' \
-d '{
"id": "smollm2",
"hf_repo": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"dtype": "f16",
"device": "metal"
}'
# 201 Created
# Load a simulation model (no download, instant)
curl -X POST http://localhost:3000/v1/models/load \
-H 'Content-Type: application/json' \
-d '{
"id": "sim-large",
"dim": 512,
"simulated_latency_ms": 100
}'
# 201 Created
# Load a gated model (HF_TOKEN must be in the server's environment)
curl -X POST http://localhost:3000/v1/models/load \
-H 'Content-Type: application/json' \
-d '{
"id": "llama3-3b",
"hf_repo": "meta-llama/Llama-3.2-3B-Instruct",
"dtype": "f16",
"device": "metal"
}'
# 201 Created (download starts in background)
# List all loaded models
curl -s http://localhost:3000/v1/models | jq '.[].id'Note: after POST /v1/models/load returns, the engine download may still be in progress in the background. Requests to that model will be served by Tier 3 simulation until the real engine is ready.
axon coalesces concurrent requests into batches automatically. A batch flushes when it reaches max_batch_size or max_wait_ms elapses. Under heavy load, full batches dispatch immediately. Under light load, no request waits longer than max_wait_ms (5 ms default).
# Ten services fire concurrently — axon assembles them into one batch
for i in $(seq 1 10); do
curl -s -X POST http://localhost:3000/v1/generate \
-H 'Content-Type: application/json' \
-d "{\"id\":\"$(uuidgen | tr '[:upper:]' '[:lower:]')\",\"model_id\":\"smollm2\",\"prompt\":\"Summarize: $i\",\"max_tokens\":50,\"temperature\":0.5,\"stream\":false}" &
done
waitcurl -N -X POST http://localhost:3000/v1/generate/stream \
-H 'Content-Type: application/json' \
-d '{
"id": "00000000-0000-0000-0000-000000000010",
"model_id": "smollm2",
"prompt": "Write a short poem about Rust:",
"max_tokens": 60,
"temperature": 0.8,
"stream": true
}'Browser EventSource integration:
const source = new EventSource('/v1/generate/stream');
source.onmessage = e => process.stdout.write(e.data);
source.addEventListener('done', () => source.close());# config/default.toml — serve two models simultaneously
[[models]]
id = "fast"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
dtype = "f16"
device = "metal"
[[models]]
id = "quality"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
dtype = "f16"
device = "metal"
max_rps = 5 # protect the larger model from overloadcurl -s http://localhost:3000/v1/metrics
# axon_requests_total 4200
# axon_requests_active 12
# axon_tokens_generated_total 86400
# axon_batches_dispatched_total 318
# axon_mean_batch_size 13.2
# axon_arena_utilization_ratio 0.32Use axon_requests_active as an autoscaling signal: scale out when > 50, scale in when < 5.
# Load a model with a cap of 10 concurrent requests
curl -X POST http://localhost:3000/v1/models/load \
-H 'Content-Type: application/json' \
-d '{"id":"expensive","hf_repo":"mistralai/Mistral-7B-Instruct-v0.3","max_rps":10}'
# 11th concurrent request → 429 Too Many Requestsaxon/
├── Cargo.toml workspace root — all shared dependency versions declared here
├── config/
│ └── default.toml server, batching, kv_cache, [[models]] configuration
├── axon-core/ shared domain types (InferenceRequest, InferenceResponse,
│ ModelConfig), typed errors, lock-free atomic metrics,
│ JSON and bincode codecs
├── axon-batch/ BatchAssembler — deadline-or-size flush loop
│ biased select! races size-trigger vs deadline-trigger
├── axon-worker/ WorkerPool — spawn_blocking → Rayon CPU bridge
│ InferenceEngine — three-tier Candle/simulation engine
├── axon-cache/ KvArena — 64-byte-aligned bump allocator
│ QuantizedKvArena — byte-level bump allocator for compressed KV slots
│ ALL unsafe code in the repo is isolated here
├── axon-quant/ TurboQuant algorithm — rotation, codebooks, MSE/Prod encode/decode,
│ bit packing, slot serialisation (no server deps; pure algorithm crate)
├── axon-macros/ proc macros: #[derive(Validated)], #[inference_route], schema!{}
└── axon-server/ axum 0.7 HTTP server
├── src/
│ ├── main.rs startup sequence (config → arena → worker pool → routes)
│ ├── config.rs AppConfig with [[models]] array support
│ ├── state.rs AppState — all shared handles (registry, assembler, pool, arena)
│ ├── dispatcher.rs dispatch loop: batch → worker pool → response routing
│ └── routes/
│ ├── generate.rs POST /v1/generate (non-streaming, with timeout)
│ ├── stream.rs POST /v1/generate/stream (SSE)
│ ├── models.rs GET/POST /v1/models and /v1/models/load
│ ├── health.rs GET /v1/health
│ └── metrics.rs GET /v1/metrics (Prometheus text format)
└── benches/ Criterion benchmarks: batch throughput, serialization, KV cache
Dependency graph (no cycles):
axon-server → axon-batch → axon-core
→ axon-worker → axon-cache
→ axon-quant
→ axon-core
→ axon-macros
→ axon-cache
→ axon-core
Without compression, every generated token costs ~128 KB of KV cache memory (for a 128-dimensional head). The default 256 MB arena fills after roughly 2,000 tokens and inference degrades to heap allocation. Long-context tasks — summarising a long document, multi-turn chat, needle-in-haystack retrieval — are effectively out of reach.
TurboQuant compresses the KV cache on the fly, with no training and no accuracy loss at 4-bit, letting the same 256 MB arena hold 14,000+ tokens instead.
TurboQuant is a two-stage vector quantizer from Google Research (ICLR 2026):
- Random rotation — multiply the KV vector by a random orthogonal matrix Π. This spreads the energy uniformly across coordinates, so each one looks like an independent Gaussian. That makes scalar quantization provably near-optimal.
- Per-coordinate quantization — apply a precomputed Lloyd-Max codebook (1–4 bits per coordinate). The codebook is fixed and tiny; there is nothing to learn or fine-tune.
The optional TurboQuantProd variant adds a third step: a 1-bit QJL sketch of the quantization residual. This makes attention dot-products unbiased in expectation — important for long-context needle-in-haystack tasks where a slightly wrong attention score can cause the model to miss the relevant token.
axon-quant is a standalone Rust crate (no HTTP or server dependencies) containing the full algorithm:
| Module | What it does |
|---|---|
codebook.rs |
Lloyd-Max centroid tables for b=1,2,3,4 bits; Codebook trait with binary-search quantize / dequantize |
rotation.rs |
TurboQuantState — holds the rotation matrix Π (built once at engine init via QR factorisation); rotate / unrotate; next_bits() for 3.5-bit alternation |
mse.rs |
mse_encode / mse_decode — MSE-optimal path; returns MseEncoded { packed_indices, bits, dim } |
prod.rs |
prod_encode / prod_decode — inner-product-optimal path; stores MSE part + QJL sketch + residual norm; attention scores are unbiased |
pack.rs |
pack_bits / unpack_bits — sub-byte bit packing for arbitrary bit-widths 1–8 |
slot.rs |
Slot header serialisation — method byte, bits, dim, residual norm, packed indices, optional QJL bytes |
arena.rs |
QuantizedKvArena — same bump-allocator pattern as KvArena but over raw u8; allocate(bytes) → QuantizedKvSlot |
error.rs |
QuantError — DimMismatch, UnsupportedBits, PackingError |
| Mode | Bits/coord | KV compression | 256 MB arena context | Quality |
|---|---|---|---|---|
| None (f16) | 16 | 1× | ~2,000 tokens | Baseline |
| TurboQuantMse b=4 | 4 | 7.1× | ~14,000 tokens | Zero loss (paper Table 1) |
| TurboQuantMse b=3.5 (alternating) | 3.5 | 8× | ~16,000 tokens | Zero loss (paper result) |
| TurboQuantMse b=3 | 3 | 9.1× | ~18,000 tokens | Minor degradation on long tasks |
| TurboQuantProd b=4 | ~4 | 2.8× | ~5,600 tokens | Unbiased attention — use for NIAH tasks |
The 3.5-bit target is achieved by TurboQuantState::next_bits() alternating between returning 3 and 4 via an atomic counter. Exact per-vector alternation order is not load-bearing for quality.
The feature is disabled by default. To enable it, add to config/default.toml:
[kv_quantization]
enabled = true
bits = 4 # 4 = zero quality loss; alternate 3/4 by setting to 3 (see next_bits)Per-model override inside a [[models]] block:
[[models]]
id = "llama3-8b"
hf_repo = "meta-llama/Meta-Llama-3-8B-Instruct"
dtype = "f16"
device = "metal"
dim = 4096
[models.kv_quant]
method = "turbo_quant_mse" # or "turbo_quant_prod" for unbiased attention
bits = 4
rotation_seed = 0xDEADBEEFCAFEBABE # same seed → same rotation matrix across restartsWhen enabled, three new Prometheus metrics appear:
axon_kv_quant_encodes_total # cumulative encode operations
axon_kv_quant_encode_latency_us_total # cumulative µs spent encoding
axon_quant_arena_utilization_ratio # quant arena fill fraction (0.0–1.0)
Each compressed KV vector is stored in a QuantizedKvSlot with this binary layout:
Offset Size Field
0 1 B method (0 = MSE, 1 = Prod)
1 1 B bits
2 2 B dim (u16 little-endian)
4 4 B residual_norm (f32; 0.0 for MSE)
8 N B packed_indices N = ceil(dim × bits / 8)
8+N dim B qjl_sketch one i8 per coord; only present for Prod variant
Example slot sizes at d=128 (one KV head):
| Variant | Bytes/slot | vs f16 (256 B) |
|---|---|---|
| MSE b=4 | 72 B | 3.6× smaller |
| MSE b=3 | 56 B | 4.6× smaller |
| Prod b=4 | 184 B | 1.4× smaller + unbiased |
# Encode/decode latency and arena throughput
cargo bench -p axon-server quant_throughput
# Unit tests for the algorithm crate
cargo test -p axon-quant
# Round-trip quality gate (debug builds only — checks ‖x - decode(encode(x))‖² ≤ 1.1 × theoretical bound)
cargo test -p axon-quant -- --nocaptureaxon-worker/src/engine.rs is the only file that was modified to add real inference. Every other layer (batching, pool, dispatcher, all routes) is unchanged.
InferenceEngine::new(config, kv_arena) → always returns Self (infallible)
│
├─ try_build_candle(config)
│ │
│ ├─ if config.hf_repo is non-empty:
│ │ download config.json, tokenizer.json, safetensors shards via hf-hub
│ │ parse LlamaConfig → Config (must be Llama-compatible architecture)
│ │ load model weights via memory-mapped safetensors
│ │ → Ok(CandleState) → EngineInner::Candle
│ │
│ ├─ if config.hf_repo is empty (autodiscovery):
│ │ try HuggingFaceTB/SmolLM2-1.7B-Instruct
│ │ try mistralai/Mistral-7B-Instruct-v0.3
│ │ first reachable model wins
│ │ → Ok(CandleState) → EngineInner::Candle
│ │
│ └─ any error (no network, no token, non-Llama arch, OOM, missing GPU)
│ → Err(e)
│
└─ on Err: log WARN, build ndarray weight matrix
→ EngineInner::Simulation { weight_matrix }
compute(&self, req) → InferenceResponse (infallible)
├─ Candle path: tokenize → prefill KV cache → autoregressive generation → decode
└─ Simulation: seed hidden state from prompt length → dim² matmul × max_tokens
enum EngineInner {
Candle(CandleState),
Simulation { weight_matrix: Array2<f32> },
}
struct CandleState {
device: candle_core::Device, // Metal / CUDA / CPU
dtype: candle_core::DType, // F16 / BF16 / F32
tokenizer: tokenizers::Tokenizer, // HF fast tokenizer
eos_token_id: u32, // token that stops generation
model: Mutex<Llama>, // model weights — Mutex serialises concurrent requests
llama_config: llama::Config, // kept to create per-request Cache
}The current engine uses candle_transformers::models::llama::Llama for real inference. This covers:
| Architecture | Example models | Support |
|---|---|---|
| Llama 2 | meta-llama/Llama-2-*-hf |
Full |
| Llama 3 / 3.1 / 3.2 | meta-llama/Meta-Llama-3-*, meta-llama/Llama-3.* |
Full |
| Mistral | mistralai/Mistral-7B-* |
Full (Llama-compatible) |
| SmolLM2 | HuggingFaceTB/SmolLM2-* |
Full (Llama-based) |
| Phi-3 | microsoft/Phi-3-* |
Falls back to simulation (needs candle_transformers::models::phi3 code path) |
| Gemma | google/gemma-* |
Falls back to simulation (needs candle_transformers::models::gemma code path) |
| Falcon | tiiuae/falcon-* |
Falls back to simulation |
To add Phi-3 support, edit axon-worker/src/engine.rs:
- Add a new variant to
EngineInner:Phi3(Phi3State), // similar to CandleState but with candle_transformers::models::phi3::Phi3
- In
try_build_candle, readconfig.json's"model_type"field and dispatch:let model_type = hf_config["model_type"].as_str().unwrap_or(""); match model_type { "llama" | "mistral" => build_llama_engine(...)?, "phi3" => build_phi3_engine(...)?, _ => anyhow::bail!("unsupported model_type: {model_type}"), }
- Implement
compute_phi3()analogously tocompute_candle().
The WorkerPool, BatchAssembler, dispatcher, and all HTTP routes remain unchanged.
| ADR | Decision | Reason |
|---|---|---|
| 001 | Tokio multi-threaded runtime | I/O-bound HTTP and CPU-bound inference require separate thread pools |
| 002 | axum 0.7 | Extractor model, first-class SSE, Tower middleware composability |
| 003 | Three-tier inference engine | Graceful degradation: always starts; Candle when available, ndarray otherwise |
| 004 | Dynamic batching (deadline-or-size) | biased select! coalesces concurrent requests; prevents starvation under load |
| 005 | tokio::sync::RwLock<ModelRegistry> |
Read-heavy; async-aware to avoid holding guard across .await |
| 006 | spawn_blocking + Rayon |
Two thread pools; Tokio threads never blocked by CPU/GPU-bound inference |
| 007 | SSE + unbounded per-request channel | Worker progress not gated on client read speed |
| 008 | KV cache memory arena (unsafe) |
Eliminates per-allocation overhead; isolated in one crate |
| 009 | Mutex<Llama> for concurrent inference |
Metal serialises GPU work anyway; Mutex is simpler than RwLock with &self forward |
| 010 | LlamaConfig → Config two-step |
HF JSON format (LlamaConfig) is serde-able; internal Config is what Llama::load and Cache::new take |
| 011 | hf-hub for weight management |
Handles auth, sharding, caching, retries; no manual download logic |
| 012 | Memory-mapped safetensors (unsafe) |
Avoids loading entire model into heap; OS pages in only accessed weights |
| 013 | anyhow for engine errors → simulation fallback |
Library errors (candle) are opaque; we only need the message string to log and degrade |
| 014 | TurboQuant as a standalone axon-quant crate |
Pure algorithm with no server deps; independently testable; can be benchmarked or reused without pulling in axum/Candle |
| 015 | One TurboQuantState per engine (not per layer) |
Rotation matrix is 128² × 4 B = 65 KB; shared via Arc across Rayon threads at zero clone cost |
| 016 | Disabled by default (kv_quantization.enabled = false) |
Allows users to opt in after verifying the baseline; avoids surprising behaviour changes on existing deployments |
| 017 | QuantizedKvArena separate from KvArena |
Keeps the existing unsafe bump allocator unchanged; separation of concerns between f32 slots and raw byte slots |
| Concept | Location |
|---|---|
| Three-tier engine fallback | axon-worker/src/engine.rs:try_build_candle() |
| Candle LLM generation loop | axon-worker/src/engine.rs:run_generation() |
| Open-model autodiscovery | axon-worker/src/engine.rs:autodiscover_repo() |
| Safetensors shard loading | axon-worker/src/engine.rs:download_weight_shards() |
| Device selection (Metal / CUDA / CPU) | axon-worker/src/engine.rs:build_device() |
RwLock<T> — read-heavy concurrency |
axon-server/src/routes/models.rs |
AtomicU64 — lock-free metrics |
axon-core/src/metrics.rs |
select! with biased — deadline batching |
axon-batch/src/assembler.rs |
tokio_stream + axum SSE |
axon-server/src/routes/stream.rs |
spawn_blocking → Rayon bridge |
axon-worker/src/pool.rs |
par_iter CPU parallelism |
axon-worker/src/pool.rs |
unsafe bump allocator + PhantomData lifetime |
axon-cache/src/arena.rs |
#[derive(Validated)] proc macro |
axon-macros/src/lib.rs |
| Graceful shutdown | axon-server/src/main.rs |
| Config hierarchy (TOML + env) | axon-server/src/config.rs:AppConfig::load() |
[[models]] array deserialization |
axon-server/src/config.rs:AppConfig.models |
| TurboQuant rotation matrix init | axon-quant/src/rotation.rs:TurboQuantState::new() |
| TurboQuant MSE encode/decode | axon-quant/src/mse.rs:mse_encode() / mse_decode() |
| TurboQuant Prod (unbiased attention) | axon-quant/src/prod.rs:prod_encode() / prod_decode() |
| Lloyd-Max codebooks (b=1..4) | axon-quant/src/codebook.rs |
| Sub-byte bit packing | axon-quant/src/pack.rs:pack_bits() / unpack_bits() |
| Quantized KV slot layout | axon-quant/src/slot.rs |
| Quantized bump allocator | axon-quant/src/arena.rs:QuantizedKvArena |
# Unit + integration tests across the workspace
cargo test --workspace
# Run with verbose output
cargo test --workspace -- --nocapture
# Criterion benchmarks (HTML report written to target/criterion/)
cargo bench -p axon-server
# Verify unsafe surface area (should show unsafe only in axon-cache)
cargo install cargo-geiger
cargo geiger --workspace
# Check for unused dependencies
cargo install cargo-udeps
cargo +nightly udeps --workspaceThe current Dockerfile uses gcr.io/distroless/cc-debian12 as the runtime image, which works for the simulation tier. For real inference with Metal (macOS), Docker is not applicable — Metal only works on bare-metal Apple hardware.
For Linux CPU inference:
docker build -t axon .
docker run -p 3000:3000 \
-e AXON__BATCHING__MAX_BATCH_SIZE=4 \
-e AXON__SERVER__REQUEST_TIMEOUT_MS=120000 \
-e HF_TOKEN=hf_xxx \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
axonThe -v mount reuses the host's HuggingFace cache so models aren't re-downloaded on every docker run.
For NVIDIA GPU (CUDA):
The runtime image needs CUDA libraries. Modify Dockerfile:
# Builder stage stays the same; change the runtime base:
FROM nvcr.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04Also change candle-core in Cargo.toml from features = ["metal"] to features = ["cuda"].
docker run --gpus all -p 3000:3000 -e HF_TOKEN=hf_xxx axonInternet → ALB (:443) → ECS Fargate (axon-server)
→ CloudWatch Logs (tracing JSON)
→ EFS mount (~/.cache/huggingface — shared across tasks)
Autoscale: axon_requests_active > 50 for 2 min → scale out
axon_requests_active < 5 for 5 min → scale in
For real inference on AWS, use GPU instances:
| Instance | GPU | VRAM | Recommended model |
|---|---|---|---|
g4dn.xlarge |
T4 | 16 GB | SmolLM2, Llama 3.2 3B |
g5.xlarge |
A10G | 24 GB | Llama 2 7B, Llama 3 8B |
p3.2xlarge |
V100 | 16 GB | SmolLM2, Llama 3.2 3B |
p4d.24xlarge |
8x A100 | 8×40 GB | Multi-model serving |
Recommended task definition env vars:
AXON__SERVER__PORT=3000
AXON__SERVER__REQUEST_TIMEOUT_MS=30000
AXON__BATCHING__MAX_BATCH_SIZE=8
AXON__BATCHING__MAX_WAIT_MS=10
AXON__KV_CACHE__CAPACITY_F32_SLOTS=67108864
RUST_LOG=axon_server=info,axon_worker=info
HF_TOKEN=<from AWS Secrets Manager>All unsafe code is isolated to axon-cache/src/arena.rs. Every unsafe block has a // SAFETY: comment. The rest of the workspace, including all Candle integration code in axon-worker/src/engine.rs, is safe Rust — the one unsafe block for memory-mapped safetensors is in the VarBuilder::from_mmaped_safetensors call which is itself inside an unsafe fn because mmap bypasses OS-level copy-on-write guarantees.
Arena invariants:
bufis non-null, 64-byte aligned, valid forcapacityf32 reads/writesAtomicUsizebump allocation guarantees non-overlapping ranges across threadsunsafe impl Send + Syncsound: shared state is only the atomic bump pointerDropdeallocates with the exact layout used at construction
cargo install cargo-geiger
cargo geiger --workspace
# unsafe: 0 in all crates except axon-cache1. Streaming delivers all tokens after full generation (not true per-token)
The dispatcher loop receives the complete InferenceResponse from the worker pool, then simulates per-token delivery by splitting response.text. Real token-by-token streaming (first token arriving in ~100 ms) requires passing the token_tx sender into the engine's generation loop. This is a dispatcher.rs + pool.rs follow-on change and does not affect any public API.
2. Rate limiter is a lifetime counter, not a per-second window
rate_counters is a DashMap<ModelId, AtomicU64> that monotonically increments. Once a model has served max_rps total requests since startup, it is permanently rate-limited. A production fix uses a sliding window or token bucket with a background reset task.
3. KV arena not reset between batches (simulation only)
The KvArena bump pointer only grows. After enough requests the arena fills and engine.compute() falls back to heap allocation silently. Fix: call kv_arena.reset() in dispatcher.rs after each batch. This affects Tier 3 simulation only — Candle manages its own memory.
4. Phi-3, Gemma, Falcon architectures fall back to simulation
The current Candle engine path handles Llama-family models only (config.json's model_type: llama | mistral). Other architectures fail to deserialize LlamaConfig and trigger Tier 3 fallback with a WARN log. Extend engine.rs with additional candle_transformers model modules to support them.
5. No OpenAI-compatible API
Request and response schemas are axon-native. To use axon as a drop-in replacement for the OpenAI chat completions API, wrap the routes in an adapter at the axum router level — the batching pipeline below the routes is unchanged.