Skip to content

dattgoswami/axon

Repository files navigation

axon

Production-grade async AI inference server in Rust — real LLM inference via HuggingFace Candle, dynamic request batching, SSE token streaming, multi-model registry, Prometheus metrics, a Rayon worker pool, and TurboQuant KV cache compression for 7× longer context windows.

axon ships with a three-tier inference engine that degrades gracefully:

Tier When active Behavior
1 — Candle + configured model hf_repo set in config + model reachable Real LLM inference via HuggingFace Candle (Metal / CUDA / CPU)
2 — Candle + open-model autodiscovery hf_repo empty, network available Downloads SmolLM2-1.7B or Mistral-7B automatically (no token required)
3 — ndarray simulation No network / no GPU / unsupported arch Original fast simulation — work ∝ dim² × max_tokens; server always starts

The server always starts. Tier selection is logged at startup. Every layer below the engine — batching, back-pressure, streaming, arena memory, HTTP routes — is unchanged across all three tiers.


Table of contents


Hardware requirements and model selection

Minimum requirements

To run simulation mode (Tier 3) — which requires nothing beyond a modern laptop:

  • Any CPU
  • 512 MB RAM
  • No GPU required
  • No network required

Open-model requirements (Tier 2 autodiscovery or Tier 1 with open repos)

Model Parameters Disk (f16) RAM needed Minimum hardware Recommended
SmolLM2-1.7B 1.7 B 3.4 GB 4 GB Any modern laptop (2019+) Best starting point
Phi-3 Mini 3.8 B 7.6 GB 9 GB 16 GB RAM laptop Strong reasoning for size
Mistral 7B 7 B 14 GB 16 GB 32 GB RAM or GPU Best open 7B

Gated-model requirements (Meta/Google license, token required)

Model Parameters Disk (f16) RAM needed Minimum hardware Notes
Llama 3.2 3B 3 B 6 GB 8 GB 16 GB RAM Best Llama for dev machines
Llama 2 7B 7 B 13.5 GB 16 GB 32 GB RAM or 8 GB VRAM Classic baseline
Llama 3 8B 8 B 16 GB 18 GB 32 GB RAM or 10 GB VRAM Best open-weights 8B
Llama 2 13B 13 B 26 GB 28 GB 32 GB RAM + Metal/CUDA High quality, heavy
Llama 3 70B 70 B 140 GB (f16) / ~40 GB (Q4) 48 GB+ Mac Studio M2 Ultra 192 GB or A100 Not recommended for dev

Apple Silicon (M1 / M2 / M3 / M4) — recommended path

Apple Silicon has unified memory shared between CPU and GPU. There is no separate VRAM limit — the model lives in the same pool as the OS and other apps. Use device = "metal" for GPU acceleration via Candle's Metal backend.

Chip Total memory What fits comfortably Recommended model
M1 / M2 8 GB 8 GB SmolLM2-1.7B (f16) only SmolLM2-1.7B
M1 / M2 16 GB 16 GB SmolLM2, Phi-3, Llama 3.2 3B Llama 3.2 3B
M1 Pro / M2 Pro 32 GB 32 GB Any 7B f16, Llama 3 8B Llama 2 7B or Mistral 7B
M2 Max / M3 Max 64 GB 64 GB Llama 3 8B, 13B Llama 3 8B or 13B
M4 Max 128 GB 128 GB Everything up to 70B Q4 Llama 3 8B recommended for daily use
M2 Ultra 192 GB 192 GB 70B models in f16 Llama 3 70B

Expected inference speed on Apple Silicon with Metal (f16):

Model M1 16 GB M2 Pro 32 GB M3 Max 64 GB M4 Max 128 GB
SmolLM2-1.7B ~50 tok/s ~70 tok/s ~80 tok/s ~90 tok/s
Llama 3.2 3B ~45 tok/s ~55 tok/s ~60 tok/s
Llama 2 7B ~22 tok/s ~28 tok/s ~30 tok/s
Llama 3 8B ~18 tok/s ~22 tok/s ~25 tok/s

NVIDIA CUDA (Linux / Windows)

GPU VRAM What fits (f16) Notes
RTX 3060 / 4060 12 GB SmolLM2, Phi-3, Llama 3.2 3B Good dev GPU
RTX 3090 / 4090 24 GB Any 7B / 8B f16 Ideal workstation
A10G (AWS g5) 24 GB Any 7B / 8B f16 Cloud standard
A100 40 GB 40 GB Llama 3 8B, Llama 2 13B Production cloud
A100 80 GB 80 GB Llama 2 70B f16 (barely) Production cloud
H100 80 GB 80 GB Llama 2 70B f16 Fastest available

To use CUDA, change device = "cuda:0" in the model config and add features = ["cuda"] to the candle-core workspace dep instead of metal.

CPU-only (Linux, Windows, Intel Mac)

CPU inference works with no feature changes. Use device = "cpu" and prefer smaller models. f32 is faster than f16 on CPU because f16 operations are emulated:

Model RAM Speed on modern 8-core CPU Practical?
SmolLM2-1.7B (f32) 7 GB ~4 tok/s Yes — usable for dev
Phi-3-mini (f32) 15 GB ~2 tok/s Slow but works
Llama 2 7B (f32) 28 GB ~0.5 tok/s Very slow; use f16 or quantized

HuggingFace account setup

Open models — no account needed

The following models require no HuggingFace account, no token:

  • HuggingFaceTB/SmolLM2-1.7B-Instruct — Apache 2.0
  • microsoft/Phi-3-mini-4k-instruct — MIT
  • mistralai/Mistral-7B-Instruct-v0.3 — Apache 2.0

Just set hf_repo in your config and run. Files download on first startup.

Gated models — account + token + license acceptance required

Models from Meta (Llama family) and Google (Gemma) require three steps:

Step 1 — Create a HuggingFace account

Go to huggingface.co and sign up. Free accounts work.

Step 2 — Accept the model license

Visit the model card and click "Agree and access repository":

Model License page
Llama 3.2 3B huggingface.co/meta-llama/Llama-3.2-3B-Instruct
Llama 3 8B huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
Llama 2 7B huggingface.co/meta-llama/Llama-2-7b-chat-hf
Llama 2 13B huggingface.co/meta-llama/Llama-2-13b-chat-hf
Gemma 7B huggingface.co/google/gemma-7b-it

Approval is usually instant for Llama 3.x. Llama 2 may take a few minutes. You receive a confirmation email.

Step 3 — Create a read access token

  1. Go to huggingface.co/settings/tokens
  2. Click New token
  3. Type: Read
  4. Name it anything (e.g., axon-local)
  5. Copy the hf_... value

Step 4 — Set the token in your environment

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Add to ~/.zshrc or ~/.bashrc to persist across terminal sessions:

echo 'export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' >> ~/.zshrc
source ~/.zshrc

Never commit the token to source control. axon reads it from the environment — never from config files.

Model cache

Models download once to ~/.cache/huggingface/hub/. Subsequent server startups load from cache instantly.

# Check what has been downloaded
du -sh ~/.cache/huggingface/hub/

# List cached model snapshots
ls ~/.cache/huggingface/hub/models--*/snapshots/main/

# Remove a cached model to free disk space
rm -rf ~/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-1.7B-Instruct/

# Move cache to a different drive (e.g., external SSD)
export HF_HOME=/Volumes/SSD/hf-cache

Approximate download times on a 100 Mbps connection:

Model Size Download time
SmolLM2-1.7B 3.4 GB ~5 min
Phi-3 Mini 3.8B 7.6 GB ~10 min
Llama 3.2 3B 6 GB ~8 min
Mistral 7B / Llama 2 7B 14 GB ~20 min
Llama 3 8B 16 GB ~22 min

Quickstart — simulation mode (zero setup)

No GPU, no internet, no token required. Works on any machine.

Prerequisites: Rust 1.75+ (rustup update stable)

git clone https://github.com/dattgoswami/axon
cd axon
cargo run --release -p axon-server

Expected startup output:

WARN  axon_server: HF_TOKEN not set — gated models (Llama 2/3/3.2) will fail; open models autodiscovered automatically
WARN  axon_worker::engine: real inference unavailable — using ndarray simulation, model_id: default, error: all autodiscovery candidates unreachable — falling back to simulation
INFO  axon_server: axon starting port=3000
INFO  axon_server: listening on 0.0.0.0:3000

The WARN messages are expected — axon is falling back to simulation because there's no network or token configured. The server is fully functional.

Test the simulation

# Health check
curl -s http://localhost:3000/v1/health | jq
# {
#   "status": "ok",
#   "models_loaded": 1,
#   "requests_total": 0,
#   "requests_active": 0
# }

# Non-streaming inference (model_id "default" is pre-loaded)
curl -s -X POST http://localhost:3000/v1/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000001",
    "model_id": "default",
    "prompt": "Hello, axon",
    "max_tokens": 20,
    "temperature": 0.7,
    "stream": false
  }' | jq
# {
#   "id": "00000000-0000-0000-0000-000000000001",
#   "model_id": "default",
#   "text": "[axon/sim] generated 20 tokens for model 'default' (dim=64)",
#   "tokens_generated": 20,
#   "latency_ms": 0
# }

# SSE streaming
curl -N -X POST http://localhost:3000/v1/generate/stream \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000002",
    "model_id": "default",
    "prompt": "Stream me tokens",
    "max_tokens": 5,
    "temperature": 0.0,
    "stream": true
  }'
# data: token_0
# data: token_1
# data: token_2
# data: token_3
# data: token_4
# event: done
# data:

# Prometheus metrics
curl -s http://localhost:3000/v1/metrics

Real inference — open models (no token)

SmolLM2-1.7B (recommended starting point)

Best choice for first real inference. Small enough to fit on any 8 GB machine, fast on Apple Silicon or any GPU, no token required.

Step 1 — Edit config/default.toml

Uncomment and save the SmolLM2 block:

[[models]]
id      = "smollm2"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
revision = "main"
dtype   = "f16"
device  = "metal"    # Apple Silicon GPU — change to "cpu" if no Metal
dim     = 2048
simulated_latency_ms = 0
max_rps = 0

For CPU-only machines (Intel Mac, Linux without GPU, Windows):

device = "cpu"
dtype  = "f32"    # f32 is faster than f16 on CPU

Step 2 — Run

cargo run --release -p axon-server
# First run: downloads 3.4 GB to ~/.cache/huggingface/hub/
# Subsequent runs: loads from cache in ~3-5 seconds

Expected startup output (first run):

WARN  axon_server: HF_TOKEN not set — ...open models autodiscovered automatically
INFO  axon_worker::engine: real inference engine ready, model_id: smollm2, hf_repo: HuggingFaceTB/SmolLM2-1.7B-Instruct, device: metal
INFO  axon_server: axon starting port=3000
INFO  axon_server: listening on 0.0.0.0:3000

Step 3 — Test

curl -s -X POST http://localhost:3000/v1/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000003",
    "model_id": "smollm2",
    "prompt": "Explain Rust ownership in one sentence:",
    "max_tokens": 80,
    "temperature": 0.3,
    "stream": false
  }' | jq

Phi-3 Mini 3.8B

Better reasoning than SmolLM2 at the cost of 2x the size. Good for 16 GB+ machines.

Edit config/default.toml:

[[models]]
id      = "phi3"
hf_repo = "microsoft/Phi-3-mini-4k-instruct"
revision = "main"
dtype   = "f16"
device  = "metal"
dim     = 3072
simulated_latency_ms = 0
max_rps = 0

Note: Phi-3 uses a different model architecture (phi3) than Llama. The current engine supports Llama-compatible architectures only. If config.json parsing fails (error: non-Llama arch), axon falls back to simulation and logs a WARN. Phi-3 support requires adding the candle_transformers::models::phi3 code path to engine.rs — see Extending axon — adding new architectures.


Mistral 7B Instruct v0.3

Best fully open 7B model. Llama-compatible architecture — works with the current engine. Requires 16 GB+ RAM or a GPU with 14 GB VRAM.

Edit config/default.toml:

[[models]]
id      = "mistral-7b"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
revision = "main"
dtype   = "f16"
device  = "metal"
dim     = 4096
simulated_latency_ms = 0
max_rps = 0

Multiple open models simultaneously

You can serve multiple models at once. Add multiple [[models]] blocks:

[[models]]
id      = "fast"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
dtype   = "f16"
device  = "metal"
dim     = 2048

[[models]]
id      = "quality"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
dtype   = "f16"
device  = "metal"
dim     = 4096

Route requests to different models by model_id:

# Fast model for autocomplete
curl -X POST http://localhost:3000/v1/generate \
  -d '{"id":"...","model_id":"fast","prompt":"Complete: def fibonacci(","max_tokens":20,"temperature":0.1,"stream":false}'

# Quality model for complex tasks
curl -X POST http://localhost:3000/v1/generate \
  -d '{"id":"...","model_id":"quality","prompt":"Explain gradient descent:","max_tokens":200,"temperature":0.5,"stream":false}'

Real inference — gated models (Llama 2 / 3)

These require the HF_TOKEN to be set. See HuggingFace account setup first.

Llama 3.2 3B Instruct (recommended gated model)

Best balance of quality and size for developer machines. Fits in 8 GB with headroom.

Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Llama-3.2-3B-Instruct

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Edit config/default.toml:

[[models]]
id       = "llama3-3b"
hf_repo  = "meta-llama/Llama-3.2-3B-Instruct"
revision = "main"
dtype    = "f16"
device   = "metal"
dim      = 3072
simulated_latency_ms = 0
max_rps  = 0

Run:

cargo run --release -p axon-server
# First run: downloads ~6 GB

Inference with Llama 3 chat template:

Llama 3 uses a special chat format with control tokens. The prompt must include the correct template for instruct models:

curl -s -X POST http://localhost:3000/v1/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000004",
    "model_id": "llama3-3b",
    "prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is ownership in Rust?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "max_tokens": 200,
    "temperature": 0.7,
    "stream": false
  }' | jq .text

Llama 2 7B Chat

The classic Meta 7B model. Higher quality than Llama 3.2 3B for complex tasks but requires more RAM.

Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Llama-2-7b-chat-hf

[[models]]
id       = "llama2-7b"
hf_repo  = "meta-llama/Llama-2-7b-chat-hf"
revision = "main"
dtype    = "f16"
device   = "metal"
dim      = 4096
simulated_latency_ms = 0
max_rps  = 0

Inference with Llama 2 chat template:

curl -s -X POST http://localhost:3000/v1/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000005",
    "model_id": "llama2-7b",
    "prompt": "[INST] Explain Rust lifetimes as if I am a Java engineer [/INST]",
    "max_tokens": 300,
    "temperature": 0.7,
    "stream": false
  }' | jq .text

Llama 3 8B Instruct

Latest generation, best capability-per-parameter open-weights model. Requires 18 GB RAM or a GPU with 16 GB VRAM. ~25 tok/s on M4 Max.

Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

[[models]]
id       = "llama3-8b"
hf_repo  = "meta-llama/Meta-Llama-3-8B-Instruct"
revision = "main"
dtype    = "f16"
device   = "metal"
dim      = 4096
simulated_latency_ms = 0
max_rps  = 0

Use the same Llama 3 chat template as above (<|begin_of_text|>...).


Llama 2 13B Chat

For machines with 32 GB+ RAM. Noticeably better output quality than 7B for multi-step reasoning.

[[models]]
id       = "llama2-13b"
hf_repo  = "meta-llama/Llama-2-13b-chat-hf"
revision = "main"
dtype    = "f16"
device   = "metal"
dim      = 5120
simulated_latency_ms = 0
max_rps  = 0

Configuration reference

config/default.toml — full schema

[server]
port               = 3000
# Timeout in ms before returning 504 to the caller.
# Increase for large models: Llama 3 8B generating 512 tokens takes ~20 seconds.
request_timeout_ms = 30000

[batching]
# Flush when this many requests accumulate (biased toward full batches under load).
# For real models, keep this low (2–8) — GPU inference is compute-bound, not I/O-bound.
max_batch_size = 4
# Flush after this many ms regardless of batch fill.
max_wait_ms    = 5

[worker_pool]
# Rayon CPU threads. 0 = num_cpus. Real inference on GPU rarely benefits from > 1 thread.
threads = 0

[kv_cache]
# Pre-allocated f32 slots for the ndarray simulation arena (Tier 3 only).
# 1_048_576 = 4 MB dev default. 67_108_864 = 256 MB for heavy simulation load.
# Real Candle inference ignores this value (Candle manages its own tensor memory).
capacity_f32_slots = 1048576

[kv_quantization]
# TurboQuant KV cache compression. Disabled by default.
# When enabled, the QuantizedKvArena is sized to the same byte budget as the f32 arena
# (capacity_f32_slots × 4 bytes), but stores ~7× more tokens at 4-bit.
enabled = false
bits    = 4     # 4 = zero quality loss (paper result); 3 = 9× compression, minor degradation

# ── [[models]] — one block per model to pre-load at startup ──────────────────
#
# All fields except `id` have defaults; you only need to set what differs
# from the default.
#
# Field descriptions:
#   id                   Unique name; used as model_id in API requests
#   hf_repo              HuggingFace repo string (e.g. "meta-llama/Llama-2-7b-hf")
#                        Empty string = autodiscover open model (Tier 2)
#   revision             Git branch, tag, or commit hash. Default: "main"
#   dtype                Weight precision. "f16" (default) | "bf16" | "f32"
#                        f16: best for Metal/CUDA. f32: better for CPU-only.
#   device               "metal" (Apple GPU, default on macOS)
#                        "cpu"   (any machine, slower)
#                        "cuda:N" (NVIDIA GPU N, Linux/Windows)
#   dim                  Hidden dimension from model config.json (hidden_size).
#                        Unused by real inference; only affects Tier 3 simulation cost.
#   simulated_latency_ms Artificial delay per batch in ms. Set to 0 for real inference.
#   max_rps              Max requests/second for this model. 0 = unlimited.

Environment variable overrides

All config fields can be overridden at runtime with AXON__<SECTION>__<KEY>:

AXON__SERVER__PORT=8080
AXON__SERVER__REQUEST_TIMEOUT_MS=60000
AXON__BATCHING__MAX_BATCH_SIZE=8
AXON__BATCHING__MAX_WAIT_MS=10
AXON__WORKER_POOL__THREADS=4
AXON__KV_CACHE__CAPACITY_F32_SLOTS=67108864

HF_TOKEN is not read from config/default.toml — always pass it as a shell environment variable.

Per-machine tuning guide

Apple Silicon (Metal)

[server]
request_timeout_ms = 30000   # 30 s is safe for 7B models

[batching]
max_batch_size = 4            # Metal serialises GPU work; small batches reduce latency
max_wait_ms    = 5

NVIDIA GPU (CUDA)

[batching]
max_batch_size = 16           # GPUs parallelise batch dimensions efficiently
max_wait_ms    = 10

CPU only

[server]
request_timeout_ms = 120000  # CPU inference is slow; allow 2 minutes for large models

[batching]
max_batch_size = 1            # One at a time; CPU can't parallelise model layers
max_wait_ms    = 1

API reference

Endpoints

Method Path Body Success
GET /v1/health 200 HealthResponse
POST /v1/generate GenerateRequest 200 InferenceResponse
POST /v1/generate/stream GenerateRequest 200 SSE stream
GET /v1/models 200 [ModelConfig]
POST /v1/models/load ModelConfig 201 Created
GET /v1/metrics 200 Prometheus text

GenerateRequest

{
  "id":          "UUID v4 — unique per request",
  "model_id":    "string — must match a loaded model",
  "prompt":      "string — 1 to 32 768 characters",
  "max_tokens":  "integer — 1 to 4096",
  "temperature": "float — 0.0 (greedy/deterministic) to 2.0 (highly random)",
  "stream":      "boolean — true for SSE token stream, false for full response"
}

temperature = 0.0 uses argmax (greedy decoding) — deterministic, best for code or factual answers. temperature = 0.7 is a good default for natural language. temperature > 1.0 produces increasingly random output.

InferenceResponse

{
  "id":               "UUID — echoed from request",
  "model_id":         "string",
  "text":             "string — generated text (empty on engine error)",
  "tokens_generated": "integer — actual tokens produced (may be < max_tokens if EOS hit)",
  "latency_ms":       "integer — wall time from engine start to response"
}

ModelConfig

All fields except id have defaults — you only need to set fields that differ:

{
  "id":                    "string — unique model name (required)",
  "hf_repo":               "string — HuggingFace repo (default: empty = autodiscover)",
  "revision":              "string — git ref (default: main)",
  "dtype":                 "string — f16 | bf16 | f32 (default: f16)",
  "device":                "string — metal | cpu | cuda:N (default: metal on macOS, cpu elsewhere)",
  "dim":                   "integer — hidden_size from config.json (default: 0; unused by real engine)",
  "simulated_latency_ms":  "integer — artificial delay, ms (default: 0)",
  "max_rps":               "integer — rate cap, 0 = unlimited (default: 0)"
}

HealthResponse

{
  "status":           "ok",
  "models_loaded":    2,
  "requests_total":   1024,
  "requests_active":  3
}

Streaming response format (SSE)

data: Once

data:  upon

data:  a

data:  time

event: done
data:

Each data: line is one decoded token string. The final event: done signals end of generation.

Error responses

Status Condition
404 Not Found model_id not in registry
422 Unprocessable Entity Request validation failed (prompt too long, max_tokens out of range, etc.)
429 Too Many Requests max_rps exceeded for this model
503 Service Unavailable Internal dispatch channel full — server overloaded
504 Gateway Timeout Inference did not complete within request_timeout_ms

Runtime model loading

In addition to pre-configuring models in config/default.toml, you can load models dynamically at runtime via the API. The engine initialisation (including any weight download) runs on a background thread so the POST returns immediately with 201 Created.

# Load SmolLM2 at runtime (no token needed)
curl -X POST http://localhost:3000/v1/models/load \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "smollm2",
    "hf_repo": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
    "dtype": "f16",
    "device": "metal"
  }'
# 201 Created

# Load a simulation model (no download, instant)
curl -X POST http://localhost:3000/v1/models/load \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "sim-large",
    "dim": 512,
    "simulated_latency_ms": 100
  }'
# 201 Created

# Load a gated model (HF_TOKEN must be in the server's environment)
curl -X POST http://localhost:3000/v1/models/load \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "llama3-3b",
    "hf_repo": "meta-llama/Llama-3.2-3B-Instruct",
    "dtype": "f16",
    "device": "metal"
  }'
# 201 Created (download starts in background)

# List all loaded models
curl -s http://localhost:3000/v1/models | jq '.[].id'

Note: after POST /v1/models/load returns, the engine download may still be in progress in the background. Requests to that model will be served by Tier 3 simulation until the real engine is ready.


Use cases

1. Batch inference for high-concurrency workloads

axon coalesces concurrent requests into batches automatically. A batch flushes when it reaches max_batch_size or max_wait_ms elapses. Under heavy load, full batches dispatch immediately. Under light load, no request waits longer than max_wait_ms (5 ms default).

# Ten services fire concurrently — axon assembles them into one batch
for i in $(seq 1 10); do
  curl -s -X POST http://localhost:3000/v1/generate \
    -H 'Content-Type: application/json' \
    -d "{\"id\":\"$(uuidgen | tr '[:upper:]' '[:lower:]')\",\"model_id\":\"smollm2\",\"prompt\":\"Summarize: $i\",\"max_tokens\":50,\"temperature\":0.5,\"stream\":false}" &
done
wait

2. Real-time token streaming for chat UIs

curl -N -X POST http://localhost:3000/v1/generate/stream \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000010",
    "model_id": "smollm2",
    "prompt": "Write a short poem about Rust:",
    "max_tokens": 60,
    "temperature": 0.8,
    "stream": true
  }'

Browser EventSource integration:

const source = new EventSource('/v1/generate/stream');
source.onmessage = e => process.stdout.write(e.data);
source.addEventListener('done', () => source.close());

3. Multi-model serving

# config/default.toml — serve two models simultaneously
[[models]]
id = "fast"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
dtype = "f16"
device = "metal"

[[models]]
id = "quality"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
dtype = "f16"
device = "metal"
max_rps = 5    # protect the larger model from overload

4. Prometheus monitoring

curl -s http://localhost:3000/v1/metrics
# axon_requests_total 4200
# axon_requests_active 12
# axon_tokens_generated_total 86400
# axon_batches_dispatched_total 318
# axon_mean_batch_size 13.2
# axon_arena_utilization_ratio 0.32

Use axon_requests_active as an autoscaling signal: scale out when > 50, scale in when < 5.

5. Per-model rate limiting

# Load a model with a cap of 10 concurrent requests
curl -X POST http://localhost:3000/v1/models/load \
  -H 'Content-Type: application/json' \
  -d '{"id":"expensive","hf_repo":"mistralai/Mistral-7B-Instruct-v0.3","max_rps":10}'
# 11th concurrent request → 429 Too Many Requests

Workspace layout

axon/
├── Cargo.toml           workspace root — all shared dependency versions declared here
├── config/
│   └── default.toml     server, batching, kv_cache, [[models]] configuration
├── axon-core/           shared domain types (InferenceRequest, InferenceResponse,
│                        ModelConfig), typed errors, lock-free atomic metrics,
│                        JSON and bincode codecs
├── axon-batch/          BatchAssembler — deadline-or-size flush loop
│                        biased select! races size-trigger vs deadline-trigger
├── axon-worker/         WorkerPool — spawn_blocking → Rayon CPU bridge
│                        InferenceEngine — three-tier Candle/simulation engine
├── axon-cache/          KvArena — 64-byte-aligned bump allocator
│                        QuantizedKvArena — byte-level bump allocator for compressed KV slots
│                        ALL unsafe code in the repo is isolated here
├── axon-quant/          TurboQuant algorithm — rotation, codebooks, MSE/Prod encode/decode,
│                        bit packing, slot serialisation (no server deps; pure algorithm crate)
├── axon-macros/         proc macros: #[derive(Validated)], #[inference_route], schema!{}
└── axon-server/         axum 0.7 HTTP server
    ├── src/
    │   ├── main.rs      startup sequence (config → arena → worker pool → routes)
    │   ├── config.rs    AppConfig with [[models]] array support
    │   ├── state.rs     AppState — all shared handles (registry, assembler, pool, arena)
    │   ├── dispatcher.rs dispatch loop: batch → worker pool → response routing
    │   └── routes/
    │       ├── generate.rs   POST /v1/generate (non-streaming, with timeout)
    │       ├── stream.rs     POST /v1/generate/stream (SSE)
    │       ├── models.rs     GET/POST /v1/models and /v1/models/load
    │       ├── health.rs     GET /v1/health
    │       └── metrics.rs    GET /v1/metrics (Prometheus text format)
    └── benches/         Criterion benchmarks: batch throughput, serialization, KV cache

Dependency graph (no cycles):

axon-server → axon-batch  → axon-core
           → axon-worker → axon-cache
                         → axon-quant
                         → axon-core
           → axon-macros
           → axon-cache
           → axon-core

TurboQuant KV cache compression

The problem it solves

Without compression, every generated token costs ~128 KB of KV cache memory (for a 128-dimensional head). The default 256 MB arena fills after roughly 2,000 tokens and inference degrades to heap allocation. Long-context tasks — summarising a long document, multi-turn chat, needle-in-haystack retrieval — are effectively out of reach.

TurboQuant compresses the KV cache on the fly, with no training and no accuracy loss at 4-bit, letting the same 256 MB arena hold 14,000+ tokens instead.

How it works (briefly)

TurboQuant is a two-stage vector quantizer from Google Research (ICLR 2026):

  1. Random rotation — multiply the KV vector by a random orthogonal matrix Π. This spreads the energy uniformly across coordinates, so each one looks like an independent Gaussian. That makes scalar quantization provably near-optimal.
  2. Per-coordinate quantization — apply a precomputed Lloyd-Max codebook (1–4 bits per coordinate). The codebook is fixed and tiny; there is nothing to learn or fine-tune.

The optional TurboQuantProd variant adds a third step: a 1-bit QJL sketch of the quantization residual. This makes attention dot-products unbiased in expectation — important for long-context needle-in-haystack tasks where a slightly wrong attention score can cause the model to miss the relevant token.

What was implemented — axon-quant

axon-quant is a standalone Rust crate (no HTTP or server dependencies) containing the full algorithm:

Module What it does
codebook.rs Lloyd-Max centroid tables for b=1,2,3,4 bits; Codebook trait with binary-search quantize / dequantize
rotation.rs TurboQuantState — holds the rotation matrix Π (built once at engine init via QR factorisation); rotate / unrotate; next_bits() for 3.5-bit alternation
mse.rs mse_encode / mse_decode — MSE-optimal path; returns MseEncoded { packed_indices, bits, dim }
prod.rs prod_encode / prod_decode — inner-product-optimal path; stores MSE part + QJL sketch + residual norm; attention scores are unbiased
pack.rs pack_bits / unpack_bits — sub-byte bit packing for arbitrary bit-widths 1–8
slot.rs Slot header serialisation — method byte, bits, dim, residual norm, packed indices, optional QJL bytes
arena.rs QuantizedKvArena — same bump-allocator pattern as KvArena but over raw u8; allocate(bytes)QuantizedKvSlot
error.rs QuantErrorDimMismatch, UnsupportedBits, PackingError

Compression ratios and context window gains

Mode Bits/coord KV compression 256 MB arena context Quality
None (f16) 16 ~2,000 tokens Baseline
TurboQuantMse b=4 4 7.1× ~14,000 tokens Zero loss (paper Table 1)
TurboQuantMse b=3.5 (alternating) 3.5 ~16,000 tokens Zero loss (paper result)
TurboQuantMse b=3 3 9.1× ~18,000 tokens Minor degradation on long tasks
TurboQuantProd b=4 ~4 2.8× ~5,600 tokens Unbiased attention — use for NIAH tasks

The 3.5-bit target is achieved by TurboQuantState::next_bits() alternating between returning 3 and 4 via an atomic counter. Exact per-vector alternation order is not load-bearing for quality.

Enabling TurboQuant

The feature is disabled by default. To enable it, add to config/default.toml:

[kv_quantization]
enabled = true
bits    = 4       # 4 = zero quality loss; alternate 3/4 by setting to 3 (see next_bits)

Per-model override inside a [[models]] block:

[[models]]
id      = "llama3-8b"
hf_repo = "meta-llama/Meta-Llama-3-8B-Instruct"
dtype   = "f16"
device  = "metal"
dim     = 4096

[models.kv_quant]
method         = "turbo_quant_mse"   # or "turbo_quant_prod" for unbiased attention
bits           = 4
rotation_seed  = 0xDEADBEEFCAFEBABE  # same seed → same rotation matrix across restarts

When enabled, three new Prometheus metrics appear:

axon_kv_quant_encodes_total           # cumulative encode operations
axon_kv_quant_encode_latency_us_total # cumulative µs spent encoding
axon_quant_arena_utilization_ratio    # quant arena fill fraction (0.0–1.0)

Slot memory layout

Each compressed KV vector is stored in a QuantizedKvSlot with this binary layout:

Offset  Size   Field
0       1 B    method  (0 = MSE, 1 = Prod)
1       1 B    bits
2       2 B    dim     (u16 little-endian)
4       4 B    residual_norm (f32; 0.0 for MSE)
8       N B    packed_indices    N = ceil(dim × bits / 8)
8+N     dim B  qjl_sketch        one i8 per coord; only present for Prod variant

Example slot sizes at d=128 (one KV head):

Variant Bytes/slot vs f16 (256 B)
MSE b=4 72 B 3.6× smaller
MSE b=3 56 B 4.6× smaller
Prod b=4 184 B 1.4× smaller + unbiased

Running the quantization benchmarks

# Encode/decode latency and arena throughput
cargo bench -p axon-server quant_throughput

# Unit tests for the algorithm crate
cargo test -p axon-quant

# Round-trip quality gate (debug builds only — checks ‖x - decode(encode(x))‖² ≤ 1.1 × theoretical bound)
cargo test -p axon-quant -- --nocapture

Inference engine internals

Three-tier fallback chain

axon-worker/src/engine.rs is the only file that was modified to add real inference. Every other layer (batching, pool, dispatcher, all routes) is unchanged.

InferenceEngine::new(config, kv_arena) → always returns Self (infallible)
  │
  ├─ try_build_candle(config)
  │    │
  │    ├─ if config.hf_repo is non-empty:
  │    │    download config.json, tokenizer.json, safetensors shards via hf-hub
  │    │    parse LlamaConfig → Config (must be Llama-compatible architecture)
  │    │    load model weights via memory-mapped safetensors
  │    │    → Ok(CandleState) → EngineInner::Candle
  │    │
  │    ├─ if config.hf_repo is empty (autodiscovery):
  │    │    try HuggingFaceTB/SmolLM2-1.7B-Instruct
  │    │    try mistralai/Mistral-7B-Instruct-v0.3
  │    │    first reachable model wins
  │    │    → Ok(CandleState) → EngineInner::Candle
  │    │
  │    └─ any error (no network, no token, non-Llama arch, OOM, missing GPU)
  │         → Err(e)
  │
  └─ on Err: log WARN, build ndarray weight matrix
       → EngineInner::Simulation { weight_matrix }

compute(&self, req) → InferenceResponse (infallible)
  ├─ Candle path:  tokenize → prefill KV cache → autoregressive generation → decode
  └─ Simulation:   seed hidden state from prompt length → dim² matmul × max_tokens

EngineInner enum (private to engine.rs)

enum EngineInner {
    Candle(CandleState),
    Simulation { weight_matrix: Array2<f32> },
}

struct CandleState {
    device: candle_core::Device,          // Metal / CUDA / CPU
    dtype: candle_core::DType,            // F16 / BF16 / F32
    tokenizer: tokenizers::Tokenizer,     // HF fast tokenizer
    eos_token_id: u32,                    // token that stops generation
    model: Mutex<Llama>,                  // model weights — Mutex serialises concurrent requests
    llama_config: llama::Config,          // kept to create per-request Cache
}

Supported model architectures

The current engine uses candle_transformers::models::llama::Llama for real inference. This covers:

Architecture Example models Support
Llama 2 meta-llama/Llama-2-*-hf Full
Llama 3 / 3.1 / 3.2 meta-llama/Meta-Llama-3-*, meta-llama/Llama-3.* Full
Mistral mistralai/Mistral-7B-* Full (Llama-compatible)
SmolLM2 HuggingFaceTB/SmolLM2-* Full (Llama-based)
Phi-3 microsoft/Phi-3-* Falls back to simulation (needs candle_transformers::models::phi3 code path)
Gemma google/gemma-* Falls back to simulation (needs candle_transformers::models::gemma code path)
Falcon tiiuae/falcon-* Falls back to simulation

Extending axon — adding new architectures

To add Phi-3 support, edit axon-worker/src/engine.rs:

  1. Add a new variant to EngineInner:
    Phi3(Phi3State),  // similar to CandleState but with candle_transformers::models::phi3::Phi3
  2. In try_build_candle, read config.json's "model_type" field and dispatch:
    let model_type = hf_config["model_type"].as_str().unwrap_or("");
    match model_type {
        "llama" | "mistral" => build_llama_engine(...)?,
        "phi3" => build_phi3_engine(...)?,
        _ => anyhow::bail!("unsupported model_type: {model_type}"),
    }
  3. Implement compute_phi3() analogously to compute_candle().

The WorkerPool, BatchAssembler, dispatcher, and all HTTP routes remain unchanged.


Architecture decisions

ADR Decision Reason
001 Tokio multi-threaded runtime I/O-bound HTTP and CPU-bound inference require separate thread pools
002 axum 0.7 Extractor model, first-class SSE, Tower middleware composability
003 Three-tier inference engine Graceful degradation: always starts; Candle when available, ndarray otherwise
004 Dynamic batching (deadline-or-size) biased select! coalesces concurrent requests; prevents starvation under load
005 tokio::sync::RwLock<ModelRegistry> Read-heavy; async-aware to avoid holding guard across .await
006 spawn_blocking + Rayon Two thread pools; Tokio threads never blocked by CPU/GPU-bound inference
007 SSE + unbounded per-request channel Worker progress not gated on client read speed
008 KV cache memory arena (unsafe) Eliminates per-allocation overhead; isolated in one crate
009 Mutex<Llama> for concurrent inference Metal serialises GPU work anyway; Mutex is simpler than RwLock with &self forward
010 LlamaConfigConfig two-step HF JSON format (LlamaConfig) is serde-able; internal Config is what Llama::load and Cache::new take
011 hf-hub for weight management Handles auth, sharding, caching, retries; no manual download logic
012 Memory-mapped safetensors (unsafe) Avoids loading entire model into heap; OS pages in only accessed weights
013 anyhow for engine errors → simulation fallback Library errors (candle) are opaque; we only need the message string to log and degrade
014 TurboQuant as a standalone axon-quant crate Pure algorithm with no server deps; independently testable; can be benchmarked or reused without pulling in axum/Candle
015 One TurboQuantState per engine (not per layer) Rotation matrix is 128² × 4 B = 65 KB; shared via Arc across Rayon threads at zero clone cost
016 Disabled by default (kv_quantization.enabled = false) Allows users to opt in after verifying the baseline; avoids surprising behaviour changes on existing deployments
017 QuantizedKvArena separate from KvArena Keeps the existing unsafe bump allocator unchanged; separation of concerns between f32 slots and raw byte slots

Key concepts by location

Concept Location
Three-tier engine fallback axon-worker/src/engine.rs:try_build_candle()
Candle LLM generation loop axon-worker/src/engine.rs:run_generation()
Open-model autodiscovery axon-worker/src/engine.rs:autodiscover_repo()
Safetensors shard loading axon-worker/src/engine.rs:download_weight_shards()
Device selection (Metal / CUDA / CPU) axon-worker/src/engine.rs:build_device()
RwLock<T> — read-heavy concurrency axon-server/src/routes/models.rs
AtomicU64 — lock-free metrics axon-core/src/metrics.rs
select! with biased — deadline batching axon-batch/src/assembler.rs
tokio_stream + axum SSE axon-server/src/routes/stream.rs
spawn_blocking → Rayon bridge axon-worker/src/pool.rs
par_iter CPU parallelism axon-worker/src/pool.rs
unsafe bump allocator + PhantomData lifetime axon-cache/src/arena.rs
#[derive(Validated)] proc macro axon-macros/src/lib.rs
Graceful shutdown axon-server/src/main.rs
Config hierarchy (TOML + env) axon-server/src/config.rs:AppConfig::load()
[[models]] array deserialization axon-server/src/config.rs:AppConfig.models
TurboQuant rotation matrix init axon-quant/src/rotation.rs:TurboQuantState::new()
TurboQuant MSE encode/decode axon-quant/src/mse.rs:mse_encode() / mse_decode()
TurboQuant Prod (unbiased attention) axon-quant/src/prod.rs:prod_encode() / prod_decode()
Lloyd-Max codebooks (b=1..4) axon-quant/src/codebook.rs
Sub-byte bit packing axon-quant/src/pack.rs:pack_bits() / unpack_bits()
Quantized KV slot layout axon-quant/src/slot.rs
Quantized bump allocator axon-quant/src/arena.rs:QuantizedKvArena

Running tests and benchmarks

# Unit + integration tests across the workspace
cargo test --workspace

# Run with verbose output
cargo test --workspace -- --nocapture

# Criterion benchmarks (HTML report written to target/criterion/)
cargo bench -p axon-server

# Verify unsafe surface area (should show unsafe only in axon-cache)
cargo install cargo-geiger
cargo geiger --workspace

# Check for unused dependencies
cargo install cargo-udeps
cargo +nightly udeps --workspace

Docker

The current Dockerfile uses gcr.io/distroless/cc-debian12 as the runtime image, which works for the simulation tier. For real inference with Metal (macOS), Docker is not applicable — Metal only works on bare-metal Apple hardware.

For Linux CPU inference:

docker build -t axon .

docker run -p 3000:3000 \
  -e AXON__BATCHING__MAX_BATCH_SIZE=4 \
  -e AXON__SERVER__REQUEST_TIMEOUT_MS=120000 \
  -e HF_TOKEN=hf_xxx \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  axon

The -v mount reuses the host's HuggingFace cache so models aren't re-downloaded on every docker run.

For NVIDIA GPU (CUDA):

The runtime image needs CUDA libraries. Modify Dockerfile:

# Builder stage stays the same; change the runtime base:
FROM nvcr.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04

Also change candle-core in Cargo.toml from features = ["metal"] to features = ["cuda"].

docker run --gpus all -p 3000:3000 -e HF_TOKEN=hf_xxx axon

AWS deployment

Internet → ALB (:443) → ECS Fargate (axon-server)
                       → CloudWatch Logs (tracing JSON)
                       → EFS mount (~/.cache/huggingface — shared across tasks)

Autoscale: axon_requests_active > 50 for 2 min → scale out
           axon_requests_active < 5  for 5 min → scale in

For real inference on AWS, use GPU instances:

Instance GPU VRAM Recommended model
g4dn.xlarge T4 16 GB SmolLM2, Llama 3.2 3B
g5.xlarge A10G 24 GB Llama 2 7B, Llama 3 8B
p3.2xlarge V100 16 GB SmolLM2, Llama 3.2 3B
p4d.24xlarge 8x A100 8×40 GB Multi-model serving

Recommended task definition env vars:

AXON__SERVER__PORT=3000
AXON__SERVER__REQUEST_TIMEOUT_MS=30000
AXON__BATCHING__MAX_BATCH_SIZE=8
AXON__BATCHING__MAX_WAIT_MS=10
AXON__KV_CACHE__CAPACITY_F32_SLOTS=67108864
RUST_LOG=axon_server=info,axon_worker=info
HF_TOKEN=<from AWS Secrets Manager>

Safety

All unsafe code is isolated to axon-cache/src/arena.rs. Every unsafe block has a // SAFETY: comment. The rest of the workspace, including all Candle integration code in axon-worker/src/engine.rs, is safe Rust — the one unsafe block for memory-mapped safetensors is in the VarBuilder::from_mmaped_safetensors call which is itself inside an unsafe fn because mmap bypasses OS-level copy-on-write guarantees.

Arena invariants:

  • buf is non-null, 64-byte aligned, valid for capacity f32 reads/writes
  • AtomicUsize bump allocation guarantees non-overlapping ranges across threads
  • unsafe impl Send + Sync sound: shared state is only the atomic bump pointer
  • Drop deallocates with the exact layout used at construction
cargo install cargo-geiger
cargo geiger --workspace
# unsafe: 0 in all crates except axon-cache

Known limitations

1. Streaming delivers all tokens after full generation (not true per-token)

The dispatcher loop receives the complete InferenceResponse from the worker pool, then simulates per-token delivery by splitting response.text. Real token-by-token streaming (first token arriving in ~100 ms) requires passing the token_tx sender into the engine's generation loop. This is a dispatcher.rs + pool.rs follow-on change and does not affect any public API.

2. Rate limiter is a lifetime counter, not a per-second window

rate_counters is a DashMap<ModelId, AtomicU64> that monotonically increments. Once a model has served max_rps total requests since startup, it is permanently rate-limited. A production fix uses a sliding window or token bucket with a background reset task.

3. KV arena not reset between batches (simulation only)

The KvArena bump pointer only grows. After enough requests the arena fills and engine.compute() falls back to heap allocation silently. Fix: call kv_arena.reset() in dispatcher.rs after each batch. This affects Tier 3 simulation only — Candle manages its own memory.

4. Phi-3, Gemma, Falcon architectures fall back to simulation

The current Candle engine path handles Llama-family models only (config.json's model_type: llama | mistral). Other architectures fail to deserialize LlamaConfig and trigger Tier 3 fallback with a WARN log. Extend engine.rs with additional candle_transformers model modules to support them.

5. No OpenAI-compatible API

Request and response schemas are axon-native. To use axon as a drop-in replacement for the OpenAI chat completions API, wrap the routes in an adapter at the axum router level — the batching pipeline below the routes is unchanged.

About

Async Rust LLM inference server with Candle, dynamic batching, SSE streaming, Metal/CUDA/CPU fallback, and TurboQuant-style KV-cache compression.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors