Security gateway for Large Language Models. Routes requests across 15 providers with automatic fallback, cost-aware smart routing, and a 6-layer defense pipeline. Drop-in replacement for the OpenAI API.
- One endpoint, 15 providers -- Send OpenAI-compatible requests and let the proxy handle translation, failover, and cost optimization across OpenAI, Anthropic, Google, Azure, Ollama, Groq, Together, Mistral, DeepSeek, xAI, Perplexity, Fireworks, OpenRouter, and SambaNova.
- Security by default -- Byte-level ASGI firewall, injection scoring, PII masking, cross-session threat intelligence, immutable audit ledger, HMAC response signing. Fail-closed auth middleware denies all admin paths unless explicitly whitelisted.
- Cost control -- Per-model pricing for 30+ models, daily budget limits with automatic downgrade to local models, per-session spend tracking, cost-efficiency analytics.
- Extensible -- 18 marketplace plugins (budget guard, A/B routing, schema enforcement, canary detection, ...) with a ring-based pipeline. Write your own in Python or WASM.
docker run --rm -p 8090:8090 \
-e LLM_PROXY_API_KEYS=sk-proxy-test \
ghcr.io/fabriziosalmi/llmproxy:latestThat's it. Open http://localhost:8090/ui and the first-run wizard walks you through adding a provider (OpenAI, Anthropic, Ollama, etc.). The proxy boots in onboarding mode with zero endpoints — inference returns 503 until you add one.
Drop-in OpenAI replacement, once an endpoint is configured:
curl http://localhost:8090/v1/chat/completions \
-H "Authorization: Bearer sk-proxy-test" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}'For persistent state (budget tracking, audit log, registered endpoints) across container restarts, mount a volume and pin the version:
docker run -d --name llmproxy -p 8090:8090 \
-e LLM_PROXY_API_KEYS=sk-proxy-test \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-v llmproxy-data:/app/data \
ghcr.io/fabriziosalmi/llmproxy:1.21.52Each release publishes :latest, the full semver (:X.Y.Z), the minor (:X.Y), plus a per-commit short SHA tag for reproducible deploys.
git clone https://github.com/fabriziosalmi/llmproxy && cd llmproxy
./install.sh # Interactive — checks Python/Docker, creates .env, starts the proxyThe installer detects your platform, verifies prerequisites, generates a proxy auth key, and boots the service via Docker Compose v2 (preferred) or a local Python 3.12+ virtualenv. Use ./install.sh --docker, ./install.sh --local, or ./install.sh --check for non-interactive flows. Choose this path if you want to modify plugins, contribute, or run without an internet connection to GHCR.
- Docker path: Docker Engine + Docker Compose v2 plugin (
docker compose). The legacydocker-composev1 (Debian/Ubuntu apt) is NOT supported — it's incompatible with modern urllib3. On Debian/Ubuntu:sudo apt install docker-compose-plugin. - Local path: Python 3.12+ (Ubuntu 22.04 only ships 3.10 — install from the deadsnakes PPA or use the Docker path).
Declare LM Studio, vLLM, TGI, Ollama, or any OpenAI-compatible endpoint directly in .env — no YAML editing required:
LLM_PROXY_ENDPOINT_LMSTUDIO_URL=http://192.168.1.50:1234/v1
LLM_PROXY_ENDPOINT_LMSTUDIO_MODELS=llama-3.3-70b,qwen-2.5-coder-32b
# LLM_PROXY_ENDPOINT_LMSTUDIO_KEY= # leave blank for no-auth local serversThe byte-level ASGI firewall is on by default. Disable via env or config when fronting the proxy with another WAF or debugging a false positive:
LLM_PROXY_FIREWALL_ENABLED=0 # in .env, or
# config.yaml:
# security:
# firewall:
# enabled: falseThe admin UI reflects the live WAF state and the reason it's off. The switch is env/config-only by design — a one-click UI toggle would make L1 injection defense trivially removable.
Client Request
|
+-- RateLimitMiddleware Token bucket per IP/key (O(1) LRU, 50k max)
+-- ByteLevelFirewall 178 signatures, 8 encoding layers, iterative chain decoding
+-- CORSMiddleware
+-- Global Auth (fail-closed) Deny-all for /api/v1/*, /admin/*, /metrics
+-- SecurityShield Injection scoring, PII masking, trajectory analysis
| +-- ThreatLedger Cross-session IP + key aggregation
| +-- SemanticAnalyzer 157 patterns, 20+ languages, leetspeak normalization
|
+-- Ring 1: INGRESS Auth, Zero-Trust, rate limiting
+-- Ring 2: PRE-FLIGHT PII masking, budget guard, cache, complexity scoring
+-- Ring 3: ROUTING Model selection, load balancing, A/B routing
+-- Upstream Provider Automatic format translation + fallback chain
+-- Ring 4: POST-FLIGHT Response sanitization, quality gate, schema enforcement
+-- Ring 5: BACKGROUND Telemetry, export, shadow traffic
|
Client Response
OpenAI, Anthropic, Google (Gemini), Azure OpenAI, Ollama, Groq, Together, Mistral, DeepSeek, xAI (Grok), Perplexity, Fireworks, OpenRouter, SambaNova. Each with a dedicated adapter that handles request/response format translation, streaming, and error mapping.
Endpoints are scored using an EMA-weighted formula: score = (success^2 / latency) * cost_factor^w. The proxy automatically routes to the best-scoring endpoint, with configurable fallback chains (e.g., GPT-4o fails -> Claude Sonnet -> Gemini Pro). When daily budget is exhausted, requests are auto-downgraded to a local model (Ollama).
| Layer | What it does |
|---|---|
| ASGI Firewall | 178 injection signatures (162 banned + 16 ROT13) across 8 encoding layers (URL, Unicode, Base64, hex, ROT13) with iterative chain decoding. Loaded from data/signatures.yaml (hot-reloadable). |
| SecurityShield | Threat scoring (8 regex patterns, threshold 0.7), multi-turn trajectory detection, cross-session ThreatLedger. |
| Semantic Analyzer | 157-pattern trigram Jaccard corpus across 20+ languages. Leetspeak normalization, Cyrillic/Greek confusable mapping. Bounded executor with 5s timeout. |
| PII Detection | Dual-mode: Presidio NLP (18 entity types) or regex fallback (email, phone, SSN, credit card, IBAN, IP, API keys). Vault-based mask/demask roundtrip. |
| Response Sanitization | Entropy guard, steganography detection (bidi overrides, zero-width chars, homoglyphs), prompt leak detection. |
| Audit Ledger | SHA256 hash-chained audit log with tamper detection. GDPR compliance: right to erasure, DSAR export, configurable retention. |
Auth: API keys, OIDC/JWT (Google, Microsoft, Apple), mTLS, Tailscale Zero-Trust. RBAC with four roles (admin, operator, user, viewer).
HMAC-SHA256 response signing proves the response was not modified after leaving the proxy.
See SECURITY.md for the full security architecture and vulnerability disclosure policy.
A curated adversarial corpus runs as a regression test on every build. Current per-category pass rate against tests/corpus/owasp_llm_top10.yaml:
| Category | Coverage | Notes |
|---|---|---|
| LLM01 — Prompt Injection | 100 % | All 12 corpus variants caught: direct, base64/hex/zero-width-encoded, leetspeak, role-play, suffix-injection, chain-of-thought, indirect tool-use |
| LLM02 — Sensitive Info (PII) | 100 % | Email · SSN · Visa · Amex · IBAN · phones · API keys |
| LLM07 — System Prompt Leakage | 100 % | Direct + indirect + continuation + translation + meta-instruction + persona-rebase |
| Benign false-positive rate | 10 % | Meta-discussion of attacks ("explain how prompt injection works") trips on purpose |
LLM03/04/06/08/09/10 are out-of-scope for the proxy itself (build-time, training-time, caller-side, model-side) — documented as N/A in the report.
Full per-entry results + known gaps + reproduction steps: docs/OWASP_LLM_COVERAGE.md. Re-generate with pytest tests/test_owasp_corpus.py.
The corpus deliberately includes the AI-judgment-bypass path: deterministic checks only. The ai_analyze_threat gray-zone escalation (when configured) catches a fraction of the listed gaps in real deployments, but it depends on an upstream model being available — so it doesn't ship in the regression number.
Single-process throughput on Apple Silicon (M-series, dev mode, no upstream call — proxy stack only):
| Endpoint | Req/s | p50 latency | p99 latency | Conditions |
|---|---|---|---|---|
/health (cold path, no upstream) |
1,313 | 7 ms | 28 ms | wrk · 2t · 10c · 20s |
/health (saturated) |
1,176 | 82 ms | 149 ms | wrk · 4t · 100c · 30s |
/api/v1/registry (light DB read) |
1,158 | 81 ms | 188 ms | wrk · 4t · 100c · 30s |
These numbers measure the proxy stack overhead — the auth middleware, ASGI firewall, route dispatch, and JSON serialization — not the cost of a real LLM call (which is dominated by upstream provider latency).
Honest read: ~1.2k req/s on a single process is a moderate-load number. For higher throughput, run multiple uvicorn workers behind a load balancer or scale horizontally. The proxy is stateless except for the SQLite store (which can be swapped for Postgres) and the in-memory rate-limit/circuit-breaker state (which is per-process by design).
Reproduce: python main.py then wrk -t4 -c100 -d30s --latency http://localhost:8090/health.
LLMProxy exposes an OpenAI-compatible API on port 8090.
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST |
Chat completion (streaming + non-streaming). 15 providers. |
/v1/completions |
POST |
Legacy text completion. |
/v1/embeddings |
POST |
Embeddings (OpenAI, Google, Ollama, Azure). |
/v1/models |
GET |
Model discovery (aggregated from all providers). |
/health |
GET |
Liveness probe. |
/metrics |
GET |
Prometheus metrics. |
| Endpoint | Method | Description |
|---|---|---|
/api/v1/registry |
GET |
Endpoint pool state. |
/api/v1/registry/{id}/toggle |
POST |
Enable/disable an endpoint. |
/api/v1/proxy/toggle |
POST |
Enable/disable the proxy. |
/api/v1/panic |
POST |
Emergency kill switch. |
/api/v1/features |
GET |
Security guard feature flags. |
/api/v1/features/toggle |
POST |
Toggle a guard. |
/api/v1/analytics/spend |
GET |
Spend breakdown by model/provider/key/date. |
/api/v1/audit |
GET |
Audit log query with filters. |
/api/v1/audit/verify |
GET |
Verify audit chain integrity. |
/api/v1/plugins |
GET |
List installed plugins. |
/api/v1/plugins/install |
POST |
Install a plugin (AST-scanned, hot-swapped). |
/api/v1/gdpr/erase/{subject} |
POST |
Right to erasure (Article 17). |
/api/v1/gdpr/export/{subject} |
GET |
Data subject access request (Article 15). |
Full API reference in the docs.
Ring-based pipeline with 18 marketplace plugins and 10 built-in defaults.
| Plugin | Ring | Description |
|---|---|---|
| Smart Budget Guard | Pre-Flight | Per-session/team budget with SQLite persistence. |
| Agentic Loop Breaker | Pre-Flight | Detects AI agents stuck in retry loops. |
| Model Downgrader | Pre-Flight | Auto-downgrades expensive models for simple prompts. |
| Context Window Guard | Pre-Flight | Blocks requests exceeding model context limit. |
| Topic Blocklist | Pre-Flight | Keyword/regex topic filtering. |
| Tool Guard | Pre-Flight | Strips restricted tools from agentic requests. |
| A/B Model Router | Routing | Routes traffic percentage to variant model. |
| Tenant QoS Router | Routing | Routes by tenant tier (free/basic/premium). |
| Response Quality Gate | Post-Flight | Detects empty, refused, or truncated responses. |
| Canary Detector | Post-Flight | Detects system prompt leakage. |
| Schema Enforcer | Post-Flight | Validates JSON responses against schema. |
| Shadow Traffic | Background | Dark-launch to shadow model for comparison. |
Write your own:
from core.plugin_sdk import BasePlugin, PluginResponse, PluginHook
class MyPlugin(BasePlugin):
name = "my_plugin"
hook = PluginHook.PRE_FLIGHT
version = "1.0.0"
async def execute(self, ctx):
return PluginResponse.passthrough()WASM plugins (Rust/Go/C) are supported via Extism for untrusted code execution. See plugins/ for the full development guide.
server:
host: 0.0.0.0
port: 8090
auth: { enabled: true, api_keys_env: "LLM_PROXY_API_KEYS" }
endpoints:
openai:
provider: "openai"
base_url: "https://api.openai.com/v1"
api_key_env: "OPENAI_API_KEY"
models: ["gpt-4o", "gpt-4o-mini"]
anthropic:
provider: "anthropic"
base_url: "https://api.anthropic.com/v1"
api_key_env: "ANTHROPIC_API_KEY"
models: ["claude-sonnet-4-20250514"]
fallback_chains:
"gpt-4o":
- { provider: anthropic, model: "claude-sonnet-4-20250514" }
- { provider: google, model: "gemini-2.5-pro" }
budget:
daily_limit: 50.0
fallback_to_local_on_limit: true
rate_limiting:
enabled: true
requests_per_minute: 60All secrets are loaded from environment variables (Infisical SDK supported). See config.yaml for the full reference.
Real-time Security Operations Center UI at /ui.
| View | What it shows |
|---|---|
| Threats | KPI cards, threat timeline chart, ring latency (P50/P95/P99), live SSE event feed |
| Guards | Master proxy toggle, per-guard enable/disable with descriptions |
| Plugins | Pipeline grid with per-plugin stats, install/uninstall/hot-swap |
| Models | Aggregated model registry with search/filter |
| Analytics | Spend breakdown by model and provider |
| Security | Audit chain verification, GDPR controls, semantic corpus stats |
| Endpoints | Registry table with circuit breaker state, priority, toggle/delete |
| Live Logs | xterm.js terminal with WebGL rendering and JSON syntax highlighting |
| Settings | Identity, RBAC matrix, webhooks, data export |
Keyboard shortcuts: Cmd+K (command palette), F (cinema mode). URL hash routing (#/guards, #/logs, ...).
- Prometheus -- 10 metrics (requests, errors, latency percentiles, TTFT, tokens, cost, budget, circuit state, injection blocks, auth failures). Pre-built Grafana dashboard and alert rules in
monitoring/. - OpenTelemetry -- Distributed tracing via OTLP. Graceful degradation when not installed.
- Sentry -- Exception tracking with PII filtering and sampling.
- Webhooks -- Slack, Teams, Discord, Generic (JSON). HMAC-SHA256 signed. SSRF-protected.
- Dataset Export -- Async JSONL with PII scrubbing, gzip rotation, optional Parquet conversion.
make test # 1183 tests, ~22s
make bench # 22 performance benchmarks
make lint # ruff
make typecheck # mypy1183 tests across 50+ modules: unit, HTTP integration, pipeline E2E, property-based fuzz (Hypothesis), 31 mathematical invariant proofs, concurrency stress tests, and performance benchmarks.
The invariant suite proves correctness properties (Jaccard axioms, normalize idempotence, token conservation, budget accounting, adapter determinism) and blocks merge on violation.
| Setting | Default | Production |
|---|---|---|
| TLS | Disabled | Enable or use a reverse proxy (Traefik, Caddy, nginx) |
| CORS | ["*"] |
Restrict to your frontend origin(s) |
| Auth | Enabled | Keep enabled, rotate API keys |
| API keys | Placeholder | Replace with strong keys |
| Presidio | Not installed | pip install presidio-analyzer presidio-anonymizer for NLP PII |
| tiktoken | Not installed | pip install tiktoken for accurate token counting |
The proxy logs warnings at startup when TLS is disabled or CORS is unrestricted.
For hardened deployments, pair with secure-proxy-manager for network-level egress filtering (domain whitelisting, direct IP blocking, IMDS protection).
GitHub Actions runs 8 jobs on every push: lint (ruff), type check (mypy), dependency audit (pip-audit), supply chain scan (.pth malware + blocked packages), syntax check, test suite with coverage gate (65%), mathematical invariants, and Docker image size check.
MIT. See LICENSE.
