axon

Production-grade async AI inference server in Rust — real LLM inference via HuggingFace Candle, dynamic request batching, SSE token streaming, multi-model registry, Prometheus metrics, a Rayon worker pool, and TurboQuant KV cache compression for 7× longer context windows.

axon ships with a three-tier inference engine that degrades gracefully:

Tier	When active	Behavior
1 — Candle + configured model	`hf_repo` set in config + model reachable	Real LLM inference via HuggingFace Candle (Metal / CUDA / CPU)
2 — Candle + open-model autodiscovery	`hf_repo` empty, network available	Downloads SmolLM2-1.7B or Mistral-7B automatically (no token required)
3 — ndarray simulation	No network / no GPU / unsupported arch	Original fast simulation — `work ∝ dim² × max_tokens`; server always starts

The server always starts. Tier selection is logged at startup. Every layer below the engine — batching, back-pressure, streaming, arena memory, HTTP routes — is unchanged across all three tiers.

Hardware requirements and model selection
HuggingFace account setup
Quickstart — simulation mode (zero setup)
Real inference — open models (no token)
Real inference — gated models (Llama 2 / 3)
Configuration reference
API reference
Runtime model loading
Use cases
Workspace layout
TurboQuant KV cache compression
Architecture decisions
Key concepts by location
Running tests and benchmarks
Docker
AWS deployment
Safety
Known limitations

Hardware requirements and model selection

Minimum requirements

To run simulation mode (Tier 3) — which requires nothing beyond a modern laptop:

Any CPU
512 MB RAM
No GPU required
No network required

Open-model requirements (Tier 2 autodiscovery or Tier 1 with open repos)

Model	Parameters	Disk (f16)	RAM needed	Minimum hardware	Recommended
SmolLM2-1.7B	1.7 B	3.4 GB	4 GB	Any modern laptop (2019+)	Best starting point
Phi-3 Mini	3.8 B	7.6 GB	9 GB	16 GB RAM laptop	Strong reasoning for size
Mistral 7B	7 B	14 GB	16 GB	32 GB RAM or GPU	Best open 7B

Gated-model requirements (Meta/Google license, token required)

Model	Parameters	Disk (f16)	RAM needed	Minimum hardware	Notes
Llama 3.2 3B	3 B	6 GB	8 GB	16 GB RAM	Best Llama for dev machines
Llama 2 7B	7 B	13.5 GB	16 GB	32 GB RAM or 8 GB VRAM	Classic baseline
Llama 3 8B	8 B	16 GB	18 GB	32 GB RAM or 10 GB VRAM	Best open-weights 8B
Llama 2 13B	13 B	26 GB	28 GB	32 GB RAM + Metal/CUDA	High quality, heavy
Llama 3 70B	70 B	140 GB (f16) / ~40 GB (Q4)	48 GB+	Mac Studio M2 Ultra 192 GB or A100	Not recommended for dev

Apple Silicon (M1 / M2 / M3 / M4) — recommended path

Apple Silicon has unified memory shared between CPU and GPU. There is no separate VRAM limit — the model lives in the same pool as the OS and other apps. Use device = "metal" for GPU acceleration via Candle's Metal backend.

Chip	Total memory	What fits comfortably	Recommended model
M1 / M2 8 GB	8 GB	SmolLM2-1.7B (f16) only	SmolLM2-1.7B
M1 / M2 16 GB	16 GB	SmolLM2, Phi-3, Llama 3.2 3B	Llama 3.2 3B
M1 Pro / M2 Pro 32 GB	32 GB	Any 7B f16, Llama 3 8B	Llama 2 7B or Mistral 7B
M2 Max / M3 Max 64 GB	64 GB	Llama 3 8B, 13B	Llama 3 8B or 13B
M4 Max 128 GB	128 GB	Everything up to 70B Q4	Llama 3 8B recommended for daily use
M2 Ultra 192 GB	192 GB	70B models in f16	Llama 3 70B

Expected inference speed on Apple Silicon with Metal (f16):

Model	M1 16 GB	M2 Pro 32 GB	M3 Max 64 GB	M4 Max 128 GB
SmolLM2-1.7B	~50 tok/s	~70 tok/s	~80 tok/s	~90 tok/s
Llama 3.2 3B	—	~45 tok/s	~55 tok/s	~60 tok/s
Llama 2 7B	—	~22 tok/s	~28 tok/s	~30 tok/s
Llama 3 8B	—	~18 tok/s	~22 tok/s	~25 tok/s

NVIDIA CUDA (Linux / Windows)

GPU	VRAM	What fits (f16)	Notes
RTX 3060 / 4060	12 GB	SmolLM2, Phi-3, Llama 3.2 3B	Good dev GPU
RTX 3090 / 4090	24 GB	Any 7B / 8B f16	Ideal workstation
A10G (AWS g5)	24 GB	Any 7B / 8B f16	Cloud standard
A100 40 GB	40 GB	Llama 3 8B, Llama 2 13B	Production cloud
A100 80 GB	80 GB	Llama 2 70B f16 (barely)	Production cloud
H100 80 GB	80 GB	Llama 2 70B f16	Fastest available

To use CUDA, change device = "cuda:0" in the model config and add features = ["cuda"] to the candle-core workspace dep instead of metal.

CPU-only (Linux, Windows, Intel Mac)

CPU inference works with no feature changes. Use device = "cpu" and prefer smaller models. f32 is faster than f16 on CPU because f16 operations are emulated:

Model	RAM	Speed on modern 8-core CPU	Practical?
SmolLM2-1.7B (f32)	7 GB	~4 tok/s	Yes — usable for dev
Phi-3-mini (f32)	15 GB	~2 tok/s	Slow but works
Llama 2 7B (f32)	28 GB	~0.5 tok/s	Very slow; use f16 or quantized

HuggingFace account setup

Open models — no account needed

The following models require no HuggingFace account, no token:

HuggingFaceTB/SmolLM2-1.7B-Instruct — Apache 2.0
microsoft/Phi-3-mini-4k-instruct — MIT
mistralai/Mistral-7B-Instruct-v0.3 — Apache 2.0

Just set hf_repo in your config and run. Files download on first startup.

Gated models — account + token + license acceptance required

Models from Meta (Llama family) and Google (Gemma) require three steps:

Step 1 — Create a HuggingFace account

Go to huggingface.co and sign up. Free accounts work.

Step 2 — Accept the model license

Visit the model card and click "Agree and access repository":

Model	License page
Llama 3.2 3B	`huggingface.co/meta-llama/Llama-3.2-3B-Instruct`
Llama 3 8B	`huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct`
Llama 2 7B	`huggingface.co/meta-llama/Llama-2-7b-chat-hf`
Llama 2 13B	`huggingface.co/meta-llama/Llama-2-13b-chat-hf`
Gemma 7B	`huggingface.co/google/gemma-7b-it`

Approval is usually instant for Llama 3.x. Llama 2 may take a few minutes. You receive a confirmation email.

Step 3 — Create a read access token

Go to huggingface.co/settings/tokens
Click New token
Type: Read
Name it anything (e.g., axon-local)
Copy the hf_... value

Step 4 — Set the token in your environment

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Add to ~/.zshrc or ~/.bashrc to persist across terminal sessions:

echo 'export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' >> ~/.zshrc
source ~/.zshrc

Never commit the token to source control. axon reads it from the environment — never from config files.

Model cache

Models download once to ~/.cache/huggingface/hub/. Subsequent server startups load from cache instantly.

# Check what has been downloaded
du -sh ~/.cache/huggingface/hub/

# List cached model snapshots
ls ~/.cache/huggingface/hub/models--*/snapshots/main/

# Remove a cached model to free disk space
rm -rf ~/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-1.7B-Instruct/

# Move cache to a different drive (e.g., external SSD)
export HF_HOME=/Volumes/SSD/hf-cache

Approximate download times on a 100 Mbps connection:

Model	Size	Download time
SmolLM2-1.7B	3.4 GB	~5 min
Phi-3 Mini 3.8B	7.6 GB	~10 min
Llama 3.2 3B	6 GB	~8 min
Mistral 7B / Llama 2 7B	14 GB	~20 min
Llama 3 8B	16 GB	~22 min

Quickstart — simulation mode (zero setup)

No GPU, no internet, no token required. Works on any machine.

Prerequisites: Rust 1.75+ (rustup update stable)

git clone https://github.com/dattgoswami/axon
cd axon
cargo run --release -p axon-server

Expected startup output:

WARN  axon_server: HF_TOKEN not set — gated models (Llama 2/3/3.2) will fail; open models autodiscovered automatically
WARN  axon_worker::engine: real inference unavailable — using ndarray simulation, model_id: default, error: all autodiscovery candidates unreachable — falling back to simulation
INFO  axon_server: axon starting port=3000
INFO  axon_server: listening on 0.0.0.0:3000

The WARN messages are expected — axon is falling back to simulation because there's no network or token configured. The server is fully functional.

Test the simulation

# Health check
curl -s http://localhost:3000/v1/health | jq
# {
#   "status": "ok",
#   "models_loaded": 1,
#   "requests_total": 0,
#   "requests_active": 0
# }

# Non-streaming inference (model_id "default" is pre-loaded)
curl -s -X POST http://localhost:3000/v1/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000001",
    "model_id": "default",
    "prompt": "Hello, axon",
    "max_tokens": 20,
    "temperature": 0.7,
    "stream": false
  }' | jq
# {
#   "id": "00000000-0000-0000-0000-000000000001",
#   "model_id": "default",
#   "text": "[axon/sim] generated 20 tokens for model 'default' (dim=64)",
#   "tokens_generated": 20,
#   "latency_ms": 0
# }

# SSE streaming
curl -N -X POST http://localhost:3000/v1/generate/stream \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000002",
    "model_id": "default",
    "prompt": "Stream me tokens",
    "max_tokens": 5,
    "temperature": 0.0,
    "stream": true
  }'
# data: token_0
# data: token_1
# data: token_2
# data: token_3
# data: token_4
# event: done
# data:

# Prometheus metrics
curl -s http://localhost:3000/v1/metrics

Real inference — open models (no token)

SmolLM2-1.7B (recommended starting point)

Best choice for first real inference. Small enough to fit on any 8 GB machine, fast on Apple Silicon or any GPU, no token required.

Step 1 — Edit config/default.toml

Uncomment and save the SmolLM2 block:

[[models]]
id      = "smollm2"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
revision = "main"
dtype   = "f16"
device  = "metal"    # Apple Silicon GPU — change to "cpu" if no Metal
dim     = 2048
simulated_latency_ms = 0
max_rps = 0

For CPU-only machines (Intel Mac, Linux without GPU, Windows):

device = "cpu"
dtype  = "f32"    # f32 is faster than f16 on CPU

Step 2 — Run

cargo run --release -p axon-server
# First run: downloads 3.4 GB to ~/.cache/huggingface/hub/
# Subsequent runs: loads from cache in ~3-5 seconds

Expected startup output (first run):

WARN  axon_server: HF_TOKEN not set — ...open models autodiscovered automatically
INFO  axon_worker::engine: real inference engine ready, model_id: smollm2, hf_repo: HuggingFaceTB/SmolLM2-1.7B-Instruct, device: metal
INFO  axon_server: axon starting port=3000
INFO  axon_server: listening on 0.0.0.0:3000

Step 3 — Test

curl -s -X POST http://localhost:3000/v1/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000003",
    "model_id": "smollm2",
    "prompt": "Explain Rust ownership in one sentence:",
    "max_tokens": 80,
    "temperature": 0.3,
    "stream": false
  }' | jq

Phi-3 Mini 3.8B

Better reasoning than SmolLM2 at the cost of 2x the size. Good for 16 GB+ machines.

Edit config/default.toml:

[[models]]
id      = "phi3"
hf_repo = "microsoft/Phi-3-mini-4k-instruct"
revision = "main"
dtype   = "f16"
device  = "metal"
dim     = 3072
simulated_latency_ms = 0
max_rps = 0

Note: Phi-3 uses a different model architecture (phi3) than Llama. The current engine supports Llama-compatible architectures only. If config.json parsing fails (error: non-Llama arch), axon falls back to simulation and logs a WARN. Phi-3 support requires adding the candle_transformers::models::phi3 code path to engine.rs — see Extending axon — adding new architectures.

Mistral 7B Instruct v0.3

Best fully open 7B model. Llama-compatible architecture — works with the current engine. Requires 16 GB+ RAM or a GPU with 14 GB VRAM.

Edit config/default.toml:

[[models]]
id      = "mistral-7b"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
revision = "main"
dtype   = "f16"
device  = "metal"
dim     = 4096
simulated_latency_ms = 0
max_rps = 0

Multiple open models simultaneously

You can serve multiple models at once. Add multiple [[models]] blocks:

[[models]]
id      = "fast"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
dtype   = "f16"
device  = "metal"
dim     = 2048

[[models]]
id      = "quality"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
dtype   = "f16"
device  = "metal"
dim     = 4096

Route requests to different models by model_id:

# Fast model for autocomplete
curl -X POST http://localhost:3000/v1/generate \
  -d '{"id":"...","model_id":"fast","prompt":"Complete: def fibonacci(","max_tokens":20,"temperature":0.1,"stream":false}'

# Quality model for complex tasks
curl -X POST http://localhost:3000/v1/generate \
  -d '{"id":"...","model_id":"quality","prompt":"Explain gradient descent:","max_tokens":200,"temperature":0.5,"stream":false}'

Real inference — gated models (Llama 2 / 3)

These require the HF_TOKEN to be set. See HuggingFace account setup first.

Llama 3.2 3B Instruct (recommended gated model)

Best balance of quality and size for developer machines. Fits in 8 GB with headroom.

Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Llama-3.2-3B-Instruct

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Edit config/default.toml:

[[models]]
id       = "llama3-3b"
hf_repo  = "meta-llama/Llama-3.2-3B-Instruct"
revision = "main"
dtype    = "f16"
device   = "metal"
dim      = 3072
simulated_latency_ms = 0
max_rps  = 0

Run:

cargo run --release -p axon-server
# First run: downloads ~6 GB

Inference with Llama 3 chat template:

Llama 3 uses a special chat format with control tokens. The prompt must include the correct template for instruct models:

curl -s -X POST http://localhost:3000/v1/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000004",
    "model_id": "llama3-3b",
    "prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is ownership in Rust?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "max_tokens": 200,
    "temperature": 0.7,
    "stream": false
  }' | jq .text

Llama 2 7B Chat

The classic Meta 7B model. Higher quality than Llama 3.2 3B for complex tasks but requires more RAM.

Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Llama-2-7b-chat-hf

[[models]]
id       = "llama2-7b"
hf_repo  = "meta-llama/Llama-2-7b-chat-hf"
revision = "main"
dtype    = "f16"
device   = "metal"
dim      = 4096
simulated_latency_ms = 0
max_rps  = 0

Inference with Llama 2 chat template:

curl -s -X POST http://localhost:3000/v1/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000005",
    "model_id": "llama2-7b",
    "prompt": "[INST] Explain Rust lifetimes as if I am a Java engineer [/INST]",
    "max_tokens": 300,
    "temperature": 0.7,
    "stream": false
  }' | jq .text

Llama 3 8B Instruct

Latest generation, best capability-per-parameter open-weights model. Requires 18 GB RAM or a GPU with 16 GB VRAM. ~25 tok/s on M4 Max.

Prerequisites: HF_TOKEN set, license accepted at huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

[[models]]
id       = "llama3-8b"
hf_repo  = "meta-llama/Meta-Llama-3-8B-Instruct"
revision = "main"
dtype    = "f16"
device   = "metal"
dim      = 4096
simulated_latency_ms = 0
max_rps  = 0

Use the same Llama 3 chat template as above (<|begin_of_text|>...).

Llama 2 13B Chat

For machines with 32 GB+ RAM. Noticeably better output quality than 7B for multi-step reasoning.

[[models]]
id       = "llama2-13b"
hf_repo  = "meta-llama/Llama-2-13b-chat-hf"
revision = "main"
dtype    = "f16"
device   = "metal"
dim      = 5120
simulated_latency_ms = 0
max_rps  = 0

Configuration reference

`config/default.toml` — full schema

[server]
port               = 3000
# Timeout in ms before returning 504 to the caller.
# Increase for large models: Llama 3 8B generating 512 tokens takes ~20 seconds.
request_timeout_ms = 30000

[batching]
# Flush when this many requests accumulate (biased toward full batches under load).
# For real models, keep this low (2–8) — GPU inference is compute-bound, not I/O-bound.
max_batch_size = 4
# Flush after this many ms regardless of batch fill.
max_wait_ms    = 5

[worker_pool]
# Rayon CPU threads. 0 = num_cpus. Real inference on GPU rarely benefits from > 1 thread.
threads = 0

[kv_cache]
# Pre-allocated f32 slots for the ndarray simulation arena (Tier 3 only).
# 1_048_576 = 4 MB dev default. 67_108_864 = 256 MB for heavy simulation load.
# Real Candle inference ignores this value (Candle manages its own tensor memory).
capacity_f32_slots = 1048576

[kv_quantization]
# TurboQuant KV cache compression. Disabled by default.
# When enabled, the QuantizedKvArena is sized to the same byte budget as the f32 arena
# (capacity_f32_slots × 4 bytes), but stores ~7× more tokens at 4-bit.
enabled = false
bits    = 4     # 4 = zero quality loss (paper result); 3 = 9× compression, minor degradation

# ── [[models]] — one block per model to pre-load at startup ──────────────────
#
# All fields except `id` have defaults; you only need to set what differs
# from the default.
#
# Field descriptions:
#   id                   Unique name; used as model_id in API requests
#   hf_repo              HuggingFace repo string (e.g. "meta-llama/Llama-2-7b-hf")
#                        Empty string = autodiscover open model (Tier 2)
#   revision             Git branch, tag, or commit hash. Default: "main"
#   dtype                Weight precision. "f16" (default) | "bf16" | "f32"
#                        f16: best for Metal/CUDA. f32: better for CPU-only.
#   device               "metal" (Apple GPU, default on macOS)
#                        "cpu"   (any machine, slower)
#                        "cuda:N" (NVIDIA GPU N, Linux/Windows)
#   dim                  Hidden dimension from model config.json (hidden_size).
#                        Unused by real inference; only affects Tier 3 simulation cost.
#   simulated_latency_ms Artificial delay per batch in ms. Set to 0 for real inference.
#   max_rps              Max requests/second for this model. 0 = unlimited.

Environment variable overrides

All config fields can be overridden at runtime with AXON__<SECTION>__<KEY>:

AXON__SERVER__PORT=8080
AXON__SERVER__REQUEST_TIMEOUT_MS=60000
AXON__BATCHING__MAX_BATCH_SIZE=8
AXON__BATCHING__MAX_WAIT_MS=10
AXON__WORKER_POOL__THREADS=4
AXON__KV_CACHE__CAPACITY_F32_SLOTS=67108864

HF_TOKEN is not read from config/default.toml — always pass it as a shell environment variable.

Per-machine tuning guide

Apple Silicon (Metal)

[server]
request_timeout_ms = 30000   # 30 s is safe for 7B models

[batching]
max_batch_size = 4            # Metal serialises GPU work; small batches reduce latency
max_wait_ms    = 5

NVIDIA GPU (CUDA)

[batching]
max_batch_size = 16           # GPUs parallelise batch dimensions efficiently
max_wait_ms    = 10

CPU only

[server]
request_timeout_ms = 120000  # CPU inference is slow; allow 2 minutes for large models

[batching]
max_batch_size = 1            # One at a time; CPU can't parallelise model layers
max_wait_ms    = 1

API reference

Endpoints

Method	Path	Body	Success
`GET`	`/v1/health`	—	`200 HealthResponse`
`POST`	`/v1/generate`	`GenerateRequest`	`200 InferenceResponse`
`POST`	`/v1/generate/stream`	`GenerateRequest`	`200 SSE stream`
`GET`	`/v1/models`	—	`200 [ModelConfig]`
`POST`	`/v1/models/load`	`ModelConfig`	`201 Created`
`GET`	`/v1/metrics`	—	`200 Prometheus text`

`GenerateRequest`

{
  "id":          "UUID v4 — unique per request",
  "model_id":    "string — must match a loaded model",
  "prompt":      "string — 1 to 32 768 characters",
  "max_tokens":  "integer — 1 to 4096",
  "temperature": "float — 0.0 (greedy/deterministic) to 2.0 (highly random)",
  "stream":      "boolean — true for SSE token stream, false for full response"
}

temperature = 0.0 uses argmax (greedy decoding) — deterministic, best for code or factual answers. temperature = 0.7 is a good default for natural language. temperature > 1.0 produces increasingly random output.

`InferenceResponse`

{
  "id":               "UUID — echoed from request",
  "model_id":         "string",
  "text":             "string — generated text (empty on engine error)",
  "tokens_generated": "integer — actual tokens produced (may be < max_tokens if EOS hit)",
  "latency_ms":       "integer — wall time from engine start to response"
}

`ModelConfig`

All fields except id have defaults — you only need to set fields that differ:

{
  "id":                    "string — unique model name (required)",
  "hf_repo":               "string — HuggingFace repo (default: empty = autodiscover)",
  "revision":              "string — git ref (default: main)",
  "dtype":                 "string — f16 | bf16 | f32 (default: f16)",
  "device":                "string — metal | cpu | cuda:N (default: metal on macOS, cpu elsewhere)",
  "dim":                   "integer — hidden_size from config.json (default: 0; unused by real engine)",
  "simulated_latency_ms":  "integer — artificial delay, ms (default: 0)",
  "max_rps":               "integer — rate cap, 0 = unlimited (default: 0)"
}

`HealthResponse`

{
  "status":           "ok",
  "models_loaded":    2,
  "requests_total":   1024,
  "requests_active":  3
}

Streaming response format (SSE)

data: Once

data:  upon

data:  a

data:  time

event: done
data:

Each data: line is one decoded token string. The final event: done signals end of generation.

Error responses

Status	Condition
`404 Not Found`	`model_id` not in registry
`422 Unprocessable Entity`	Request validation failed (prompt too long, max_tokens out of range, etc.)
`429 Too Many Requests`	`max_rps` exceeded for this model
`503 Service Unavailable`	Internal dispatch channel full — server overloaded
`504 Gateway Timeout`	Inference did not complete within `request_timeout_ms`

Runtime model loading

In addition to pre-configuring models in config/default.toml, you can load models dynamically at runtime via the API. The engine initialisation (including any weight download) runs on a background thread so the POST returns immediately with 201 Created.

# Load SmolLM2 at runtime (no token needed)
curl -X POST http://localhost:3000/v1/models/load \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "smollm2",
    "hf_repo": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
    "dtype": "f16",
    "device": "metal"
  }'
# 201 Created

# Load a simulation model (no download, instant)
curl -X POST http://localhost:3000/v1/models/load \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "sim-large",
    "dim": 512,
    "simulated_latency_ms": 100
  }'
# 201 Created

# Load a gated model (HF_TOKEN must be in the server's environment)
curl -X POST http://localhost:3000/v1/models/load \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "llama3-3b",
    "hf_repo": "meta-llama/Llama-3.2-3B-Instruct",
    "dtype": "f16",
    "device": "metal"
  }'
# 201 Created (download starts in background)

# List all loaded models
curl -s http://localhost:3000/v1/models | jq '.[].id'

Note: after POST /v1/models/load returns, the engine download may still be in progress in the background. Requests to that model will be served by Tier 3 simulation until the real engine is ready.

Use cases

1. Batch inference for high-concurrency workloads

axon coalesces concurrent requests into batches automatically. A batch flushes when it reaches max_batch_size or max_wait_ms elapses. Under heavy load, full batches dispatch immediately. Under light load, no request waits longer than max_wait_ms (5 ms default).

# Ten services fire concurrently — axon assembles them into one batch
for i in $(seq 1 10); do
  curl -s -X POST http://localhost:3000/v1/generate \
    -H 'Content-Type: application/json' \
    -d "{\"id\":\"$(uuidgen | tr '[:upper:]' '[:lower:]')\",\"model_id\":\"smollm2\",\"prompt\":\"Summarize: $i\",\"max_tokens\":50,\"temperature\":0.5,\"stream\":false}" &
done
wait

2. Real-time token streaming for chat UIs

curl -N -X POST http://localhost:3000/v1/generate/stream \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "00000000-0000-0000-0000-000000000010",
    "model_id": "smollm2",
    "prompt": "Write a short poem about Rust:",
    "max_tokens": 60,
    "temperature": 0.8,
    "stream": true
  }'

Browser EventSource integration:

const source = new EventSource('/v1/generate/stream');
source.onmessage = e => process.stdout.write(e.data);
source.addEventListener('done', () => source.close());

3. Multi-model serving

# config/default.toml — serve two models simultaneously
[[models]]
id = "fast"
hf_repo = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
dtype = "f16"
device = "metal"

[[models]]
id = "quality"
hf_repo = "mistralai/Mistral-7B-Instruct-v0.3"
dtype = "f16"
device = "metal"
max_rps = 5    # protect the larger model from overload

4. Prometheus monitoring

curl -s http://localhost:3000/v1/metrics
# axon_requests_total 4200
# axon_requests_active 12
# axon_tokens_generated_total 86400
# axon_batches_dispatched_total 318
# axon_mean_batch_size 13.2
# axon_arena_utilization_ratio 0.32

Use axon_requests_active as an autoscaling signal: scale out when > 50, scale in when < 5.

5. Per-model rate limiting

# Load a model with a cap of 10 concurrent requests
curl -X POST http://localhost:3000/v1/models/load \
  -H 'Content-Type: application/json' \
  -d '{"id":"expensive","hf_repo":"mistralai/Mistral-7B-Instruct-v0.3","max_rps":10}'
# 11th concurrent request → 429 Too Many Requests

Workspace layout

axon/
├── Cargo.toml           workspace root — all shared dependency versions declared here
├── config/
│   └── default.toml     server, batching, kv_cache, [[models]] configuration
├── axon-core/           shared domain types (InferenceRequest, InferenceResponse,
│                        ModelConfig), typed errors, lock-free atomic metrics,
│                        JSON and bincode codecs
├── axon-batch/          BatchAssembler — deadline-or-size flush loop
│                        biased select! races size-trigger vs deadline-trigger
├── axon-worker/         WorkerPool — spawn_blocking → Rayon CPU bridge
│                        InferenceEngine — three-tier Candle/simulation engine
├── axon-cache/          KvArena — 64-byte-aligned bump allocator
│                        QuantizedKvArena — byte-level bump allocator for compressed KV slots
│                        ALL unsafe code in the repo is isolated here
├── axon-quant/          TurboQuant algorithm — rotation, codebooks, MSE/Prod encode/decode,
│                        bit packing, slot serialisation (no server deps; pure algorithm crate)
├── axon-macros/         proc macros: #[derive(Validated)], #[inference_route], schema!{}
└── axon-server/         axum 0.7 HTTP server
    ├── src/
    │   ├── main.rs      startup sequence (config → arena → worker pool → routes)
    │   ├── config.rs    AppConfig with [[models]] array support
    │   ├── state.rs     AppState — all shared handles (registry, assembler, pool, arena)
    │   ├── dispatcher.rs dispatch loop: batch → worker pool → response routing
    │   └── routes/
    │       ├── generate.rs   POST /v1/generate (non-streaming, with timeout)
    │       ├── stream.rs     POST /v1/generate/stream (SSE)
    │       ├── models.rs     GET/POST /v1/models and /v1/models/load
    │       ├── health.rs     GET /v1/health
    │       └── metrics.rs    GET /v1/metrics (Prometheus text format)
    └── benches/         Criterion benchmarks: batch throughput, serialization, KV cache

Dependency graph (no cycles):

axon-server → axon-batch  → axon-core
           → axon-worker → axon-cache
                         → axon-quant
                         → axon-core
           → axon-macros
           → axon-cache
           → axon-core

TurboQuant KV cache compression

The problem it solves

Without compression, every generated token costs ~128 KB of KV cache memory (for a 128-dimensional head). The default 256 MB arena fills after roughly 2,000 tokens and inference degrades to heap allocation. Long-context tasks — summarising a long document, multi-turn chat, needle-in-haystack retrieval — are effectively out of reach.

TurboQuant compresses the KV cache on the fly, with no training and no accuracy loss at 4-bit, letting the same 256 MB arena hold 14,000+ tokens instead.

How it works (briefly)

TurboQuant is a two-stage vector quantizer from Google Research (ICLR 2026):

Random rotation — multiply the KV vector by a random orthogonal matrix Π. This spreads the energy uniformly across coordinates, so each one looks like an independent Gaussian. That makes scalar quantization provably near-optimal.
Per-coordinate quantization — apply a precomputed Lloyd-Max codebook (1–4 bits per coordinate). The codebook is fixed and tiny; there is nothing to learn or fine-tune.

The optional TurboQuantProd variant adds a third step: a 1-bit QJL sketch of the quantization residual. This makes attention dot-products unbiased in expectation — important for long-context needle-in-haystack tasks where a slightly wrong attention score can cause the model to miss the relevant token.

What was implemented — `axon-quant`

axon-quant is a standalone Rust crate (no HTTP or server dependencies) containing the full algorithm:

Module	What it does
`codebook.rs`	Lloyd-Max centroid tables for b=1,2,3,4 bits; `Codebook` trait with binary-search `quantize` / `dequantize`
`rotation.rs`	`TurboQuantState` — holds the rotation matrix Π (built once at engine init via QR factorisation); `rotate` / `unrotate`; `next_bits()` for 3.5-bit alternation
`mse.rs`	`mse_encode` / `mse_decode` — MSE-optimal path; returns `MseEncoded { packed_indices, bits, dim }`
`prod.rs`	`prod_encode` / `prod_decode` — inner-product-optimal path; stores MSE part + QJL sketch + residual norm; attention scores are unbiased
`pack.rs`	`pack_bits` / `unpack_bits` — sub-byte bit packing for arbitrary bit-widths 1–8
`slot.rs`	Slot header serialisation — method byte, bits, dim, residual norm, packed indices, optional QJL bytes
`arena.rs`	`QuantizedKvArena` — same bump-allocator pattern as `KvArena` but over raw `u8`; `allocate(bytes)` → `QuantizedKvSlot`
`error.rs`	`QuantError` — `DimMismatch`, `UnsupportedBits`, `PackingError`

Compression ratios and context window gains

Mode	Bits/coord	KV compression	256 MB arena context	Quality
None (f16)	16	1×	~2,000 tokens	Baseline
TurboQuantMse b=4	4	7.1×	~14,000 tokens	Zero loss (paper Table 1)
TurboQuantMse b=3.5 (alternating)	3.5	8×	~16,000 tokens	Zero loss (paper result)
TurboQuantMse b=3	3	9.1×	~18,000 tokens	Minor degradation on long tasks
TurboQuantProd b=4	~4	2.8×	~5,600 tokens	Unbiased attention — use for NIAH tasks

The 3.5-bit target is achieved by TurboQuantState::next_bits() alternating between returning 3 and 4 via an atomic counter. Exact per-vector alternation order is not load-bearing for quality.

Enabling TurboQuant

The feature is disabled by default. To enable it, add to config/default.toml:

[kv_quantization]
enabled = true
bits    = 4       # 4 = zero quality loss; alternate 3/4 by setting to 3 (see next_bits)

Per-model override inside a [[models]] block:

[[models]]
id      = "llama3-8b"
hf_repo = "meta-llama/Meta-Llama-3-8B-Instruct"
dtype   = "f16"
device  = "metal"
dim     = 4096

[models.kv_quant]
method         = "turbo_quant_mse"   # or "turbo_quant_prod" for unbiased attention
bits           = 4
rotation_seed  = 0xDEADBEEFCAFEBABE  # same seed → same rotation matrix across restarts

When enabled, three new Prometheus metrics appear:

axon_kv_quant_encodes_total           # cumulative encode operations
axon_kv_quant_encode_latency_us_total # cumulative µs spent encoding
axon_quant_arena_utilization_ratio    # quant arena fill fraction (0.0–1.0)

Slot memory layout

Each compressed KV vector is stored in a QuantizedKvSlot with this binary layout:

Offset  Size   Field
0       1 B    method  (0 = MSE, 1 = Prod)
1       1 B    bits
2       2 B    dim     (u16 little-endian)
4       4 B    residual_norm (f32; 0.0 for MSE)
8       N B    packed_indices    N = ceil(dim × bits / 8)
8+N     dim B  qjl_sketch        one i8 per coord; only present for Prod variant

Example slot sizes at d=128 (one KV head):

Variant	Bytes/slot	vs f16 (256 B)
MSE b=4	72 B	3.6× smaller
MSE b=3	56 B	4.6× smaller
Prod b=4	184 B	1.4× smaller + unbiased

Running the quantization benchmarks

# Encode/decode latency and arena throughput
cargo bench -p axon-server quant_throughput

# Unit tests for the algorithm crate
cargo test -p axon-quant

# Round-trip quality gate (debug builds only — checks ‖x - decode(encode(x))‖² ≤ 1.1 × theoretical bound)
cargo test -p axon-quant -- --nocapture

Inference engine internals

Three-tier fallback chain

axon-worker/src/engine.rs is the only file that was modified to add real inference. Every other layer (batching, pool, dispatcher, all routes) is unchanged.

InferenceEngine::new(config, kv_arena) → always returns Self (infallible)
  │
  ├─ try_build_candle(config)
  │    │
  │    ├─ if config.hf_repo is non-empty:
  │    │    download config.json, tokenizer.json, safetensors shards via hf-hub
  │    │    parse LlamaConfig → Config (must be Llama-compatible architecture)
  │    │    load model weights via memory-mapped safetensors
  │    │    → Ok(CandleState) → EngineInner::Candle
  │    │
  │    ├─ if config.hf_repo is empty (autodiscovery):
  │    │    try HuggingFaceTB/SmolLM2-1.7B-Instruct
  │    │    try mistralai/Mistral-7B-Instruct-v0.3
  │    │    first reachable model wins
  │    │    → Ok(CandleState) → EngineInner::Candle
  │    │
  │    └─ any error (no network, no token, non-Llama arch, OOM, missing GPU)
  │         → Err(e)
  │
  └─ on Err: log WARN, build ndarray weight matrix
       → EngineInner::Simulation { weight_matrix }

compute(&self, req) → InferenceResponse (infallible)
  ├─ Candle path:  tokenize → prefill KV cache → autoregressive generation → decode
  └─ Simulation:   seed hidden state from prompt length → dim² matmul × max_tokens

`EngineInner` enum (private to `engine.rs`)

enum EngineInner {
    Candle(CandleState),
    Simulation { weight_matrix: Array2<f32> },
}

struct CandleState {
    device: candle_core::Device,          // Metal / CUDA / CPU
    dtype: candle_core::DType,            // F16 / BF16 / F32
    tokenizer: tokenizers::Tokenizer,     // HF fast tokenizer
    eos_token_id: u32,                    // token that stops generation
    model: Mutex<Llama>,                  // model weights — Mutex serialises concurrent requests
    llama_config: llama::Config,          // kept to create per-request Cache
}

Supported model architectures

The current engine uses candle_transformers::models::llama::Llama for real inference. This covers:

Architecture	Example models	Support
Llama 2	`meta-llama/Llama-2-*-hf`	Full
Llama 3 / 3.1 / 3.2	`meta-llama/Meta-Llama-3-`, `meta-llama/Llama-3.`	Full
Mistral	`mistralai/Mistral-7B-*`	Full (Llama-compatible)
SmolLM2	`HuggingFaceTB/SmolLM2-*`	Full (Llama-based)
Phi-3	`microsoft/Phi-3-*`	Falls back to simulation (needs `candle_transformers::models::phi3` code path)
Gemma	`google/gemma-*`	Falls back to simulation (needs `candle_transformers::models::gemma` code path)
Falcon	`tiiuae/falcon-*`	Falls back to simulation

Extending axon — adding new architectures

To add Phi-3 support, edit axon-worker/src/engine.rs:

Add a new variant to EngineInner:

Phi3(Phi3State),  // similar to CandleState but with candle_transformers::models::phi3::Phi3

In try_build_candle, read config.json's "model_type" field and dispatch:

let model_type = hf_config["model_type"].as_str().unwrap_or("");
match model_type {
    "llama" | "mistral" => build_llama_engine(...)?,
    "phi3" => build_phi3_engine(...)?,
    _ => anyhow::bail!("unsupported model_type: {model_type}"),
}

Implement compute_phi3() analogously to compute_candle().

The WorkerPool, BatchAssembler, dispatcher, and all HTTP routes remain unchanged.

Architecture decisions

ADR	Decision	Reason
001	Tokio multi-threaded runtime	I/O-bound HTTP and CPU-bound inference require separate thread pools
002	axum 0.7	Extractor model, first-class SSE, Tower middleware composability
003	Three-tier inference engine	Graceful degradation: always starts; Candle when available, ndarray otherwise
004	Dynamic batching (deadline-or-size)	`biased select!` coalesces concurrent requests; prevents starvation under load
005	`tokio::sync::RwLock<ModelRegistry>`	Read-heavy; async-aware to avoid holding guard across `.await`
006	`spawn_blocking` + Rayon	Two thread pools; Tokio threads never blocked by CPU/GPU-bound inference
007	SSE + unbounded per-request channel	Worker progress not gated on client read speed
008	KV cache memory arena (`unsafe`)	Eliminates per-allocation overhead; isolated in one crate
009	`Mutex<Llama>` for concurrent inference	Metal serialises GPU work anyway; Mutex is simpler than RwLock with `&self` forward
010	`LlamaConfig` → `Config` two-step	HF JSON format (`LlamaConfig`) is serde-able; internal `Config` is what `Llama::load` and `Cache::new` take
011	`hf-hub` for weight management	Handles auth, sharding, caching, retries; no manual download logic
012	Memory-mapped safetensors (`unsafe`)	Avoids loading entire model into heap; OS pages in only accessed weights
013	`anyhow` for engine errors → simulation fallback	Library errors (candle) are opaque; we only need the message string to log and degrade
014	TurboQuant as a standalone `axon-quant` crate	Pure algorithm with no server deps; independently testable; can be benchmarked or reused without pulling in axum/Candle
015	One `TurboQuantState` per engine (not per layer)	Rotation matrix is 128² × 4 B = 65 KB; shared via `Arc` across Rayon threads at zero clone cost
016	Disabled by default (`kv_quantization.enabled = false`)	Allows users to opt in after verifying the baseline; avoids surprising behaviour changes on existing deployments
017	`QuantizedKvArena` separate from `KvArena`	Keeps the existing unsafe bump allocator unchanged; separation of concerns between f32 slots and raw byte slots

Key concepts by location

Concept	Location
Three-tier engine fallback	`axon-worker/src/engine.rs:try_build_candle()`
Candle LLM generation loop	`axon-worker/src/engine.rs:run_generation()`
Open-model autodiscovery	`axon-worker/src/engine.rs:autodiscover_repo()`
Safetensors shard loading	`axon-worker/src/engine.rs:download_weight_shards()`
Device selection (Metal / CUDA / CPU)	`axon-worker/src/engine.rs:build_device()`
`RwLock<T>` — read-heavy concurrency	`axon-server/src/routes/models.rs`
`AtomicU64` — lock-free metrics	`axon-core/src/metrics.rs`
`select!` with `biased` — deadline batching	`axon-batch/src/assembler.rs`
`tokio_stream` + axum SSE	`axon-server/src/routes/stream.rs`
`spawn_blocking` → Rayon bridge	`axon-worker/src/pool.rs`
`par_iter` CPU parallelism	`axon-worker/src/pool.rs`
`unsafe` bump allocator + `PhantomData` lifetime	`axon-cache/src/arena.rs`
`#[derive(Validated)]` proc macro	`axon-macros/src/lib.rs`
Graceful shutdown	`axon-server/src/main.rs`
Config hierarchy (TOML + env)	`axon-server/src/config.rs:AppConfig::load()`
`[[models]]` array deserialization	`axon-server/src/config.rs:AppConfig.models`
TurboQuant rotation matrix init	`axon-quant/src/rotation.rs:TurboQuantState::new()`
TurboQuant MSE encode/decode	`axon-quant/src/mse.rs:mse_encode()` / `mse_decode()`
TurboQuant Prod (unbiased attention)	`axon-quant/src/prod.rs:prod_encode()` / `prod_decode()`
Lloyd-Max codebooks (b=1..4)	`axon-quant/src/codebook.rs`
Sub-byte bit packing	`axon-quant/src/pack.rs:pack_bits()` / `unpack_bits()`
Quantized KV slot layout	`axon-quant/src/slot.rs`
Quantized bump allocator	`axon-quant/src/arena.rs:QuantizedKvArena`

Running tests and benchmarks

# Unit + integration tests across the workspace
cargo test --workspace

# Run with verbose output
cargo test --workspace -- --nocapture

# Criterion benchmarks (HTML report written to target/criterion/)
cargo bench -p axon-server

# Verify unsafe surface area (should show unsafe only in axon-cache)
cargo install cargo-geiger
cargo geiger --workspace

# Check for unused dependencies
cargo install cargo-udeps
cargo +nightly udeps --workspace

Docker

The current Dockerfile uses gcr.io/distroless/cc-debian12 as the runtime image, which works for the simulation tier. For real inference with Metal (macOS), Docker is not applicable — Metal only works on bare-metal Apple hardware.

For Linux CPU inference:

docker build -t axon .

docker run -p 3000:3000 \
  -e AXON__BATCHING__MAX_BATCH_SIZE=4 \
  -e AXON__SERVER__REQUEST_TIMEOUT_MS=120000 \
  -e HF_TOKEN=hf_xxx \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  axon

The -v mount reuses the host's HuggingFace cache so models aren't re-downloaded on every docker run.

For NVIDIA GPU (CUDA):

The runtime image needs CUDA libraries. Modify Dockerfile:

# Builder stage stays the same; change the runtime base:
FROM nvcr.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04

Also change candle-core in Cargo.toml from features = ["metal"] to features = ["cuda"].

docker run --gpus all -p 3000:3000 -e HF_TOKEN=hf_xxx axon

AWS deployment

Internet → ALB (:443) → ECS Fargate (axon-server)
                       → CloudWatch Logs (tracing JSON)
                       → EFS mount (~/.cache/huggingface — shared across tasks)

Autoscale: axon_requests_active > 50 for 2 min → scale out
           axon_requests_active < 5  for 5 min → scale in

For real inference on AWS, use GPU instances:

Instance	GPU	VRAM	Recommended model
`g4dn.xlarge`	T4	16 GB	SmolLM2, Llama 3.2 3B
`g5.xlarge`	A10G	24 GB	Llama 2 7B, Llama 3 8B
`p3.2xlarge`	V100	16 GB	SmolLM2, Llama 3.2 3B
`p4d.24xlarge`	8x A100	8×40 GB	Multi-model serving

Recommended task definition env vars:

AXON__SERVER__PORT=3000
AXON__SERVER__REQUEST_TIMEOUT_MS=30000
AXON__BATCHING__MAX_BATCH_SIZE=8
AXON__BATCHING__MAX_WAIT_MS=10
AXON__KV_CACHE__CAPACITY_F32_SLOTS=67108864
RUST_LOG=axon_server=info,axon_worker=info
HF_TOKEN=<from AWS Secrets Manager>

Safety

All unsafe code is isolated to axon-cache/src/arena.rs. Every unsafe block has a // SAFETY: comment. The rest of the workspace, including all Candle integration code in axon-worker/src/engine.rs, is safe Rust — the one unsafe block for memory-mapped safetensors is in the VarBuilder::from_mmaped_safetensors call which is itself inside an unsafe fn because mmap bypasses OS-level copy-on-write guarantees.

Arena invariants:

buf is non-null, 64-byte aligned, valid for capacity f32 reads/writes
AtomicUsize bump allocation guarantees non-overlapping ranges across threads
unsafe impl Send + Sync sound: shared state is only the atomic bump pointer
Drop deallocates with the exact layout used at construction

cargo install cargo-geiger
cargo geiger --workspace
# unsafe: 0 in all crates except axon-cache

Known limitations

1. Streaming delivers all tokens after full generation (not true per-token)

The dispatcher loop receives the complete InferenceResponse from the worker pool, then simulates per-token delivery by splitting response.text. Real token-by-token streaming (first token arriving in ~100 ms) requires passing the token_tx sender into the engine's generation loop. This is a dispatcher.rs + pool.rs follow-on change and does not affect any public API.

2. Rate limiter is a lifetime counter, not a per-second window

rate_counters is a DashMap<ModelId, AtomicU64> that monotonically increments. Once a model has served max_rps total requests since startup, it is permanently rate-limited. A production fix uses a sliding window or token bucket with a background reset task.

3. KV arena not reset between batches (simulation only)

The KvArena bump pointer only grows. After enough requests the arena fills and engine.compute() falls back to heap allocation silently. Fix: call kv_arena.reset() in dispatcher.rs after each batch. This affects Tier 3 simulation only — Candle manages its own memory.

4. Phi-3, Gemma, Falcon architectures fall back to simulation

The current Candle engine path handles Llama-family models only (config.json's model_type: llama | mistral). Other architectures fail to deserialize LlamaConfig and trigger Tier 3 fallback with a WARN log. Extend engine.rs with additional candle_transformers model modules to support them.

5. No OpenAI-compatible API

Request and response schemas are axon-native. To use axon as a drop-in replacement for the OpenAI chat completions API, wrap the routes in an adapter at the axum router level — the batching pipeline below the routes is unchanged.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
axon-batch		axon-batch
axon-cache		axon-cache
axon-core		axon-core
axon-macros		axon-macros
axon-quant		axon-quant
axon-server		axon-server
axon-worker		axon-worker
config		config
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

axon

Table of contents

Hardware requirements and model selection

Minimum requirements

Open-model requirements (Tier 2 autodiscovery or Tier 1 with open repos)

Gated-model requirements (Meta/Google license, token required)

Apple Silicon (M1 / M2 / M3 / M4) — recommended path

NVIDIA CUDA (Linux / Windows)

CPU-only (Linux, Windows, Intel Mac)

HuggingFace account setup

Open models — no account needed

Gated models — account + token + license acceptance required

Model cache

Quickstart — simulation mode (zero setup)

Test the simulation

Real inference — open models (no token)

SmolLM2-1.7B (recommended starting point)

Phi-3 Mini 3.8B

Mistral 7B Instruct v0.3

Multiple open models simultaneously

Real inference — gated models (Llama 2 / 3)

Llama 3.2 3B Instruct (recommended gated model)

Llama 2 7B Chat

Llama 3 8B Instruct

Llama 2 13B Chat

Configuration reference

config/default.toml — full schema

Environment variable overrides

Per-machine tuning guide

API reference

Endpoints

GenerateRequest

InferenceResponse

ModelConfig

HealthResponse

Streaming response format (SSE)

Error responses

Runtime model loading

Use cases

1. Batch inference for high-concurrency workloads

2. Real-time token streaming for chat UIs

3. Multi-model serving

4. Prometheus monitoring

5. Per-model rate limiting

Workspace layout

TurboQuant KV cache compression

The problem it solves

How it works (briefly)

What was implemented — axon-quant

Compression ratios and context window gains

Enabling TurboQuant

Slot memory layout

Running the quantization benchmarks

Inference engine internals

Three-tier fallback chain

EngineInner enum (private to engine.rs)

Supported model architectures

Extending axon — adding new architectures

Architecture decisions

Key concepts by location

Running tests and benchmarks

Docker

AWS deployment

Safety

Known limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

`config/default.toml` — full schema

`GenerateRequest`

`InferenceResponse`

`ModelConfig`

`HealthResponse`

What was implemented — `axon-quant`

`EngineInner` enum (private to `engine.rs`)

Packages