LLMProxy

Security gateway for Large Language Models. Routes requests across 15 providers with automatic fallback, cost-aware smart routing, and a 6-layer defense pipeline. Drop-in replacement for the OpenAI API.

Why LLMProxy

One endpoint, 15 providers -- Send OpenAI-compatible requests and let the proxy handle translation, failover, and cost optimization across OpenAI, Anthropic, Google, Azure, Ollama, Groq, Together, Mistral, DeepSeek, xAI, Perplexity, Fireworks, OpenRouter, and SambaNova.
Security by default -- Byte-level ASGI firewall, injection scoring, PII masking, cross-session threat intelligence, immutable audit ledger, HMAC response signing. Fail-closed auth middleware denies all admin paths unless explicitly whitelisted.
Cost control -- Per-model pricing for 30+ models, daily budget limits with automatic downgrade to local models, per-session spend tracking, cost-efficiency analytics.
Extensible -- 18 marketplace plugins (budget guard, A/B routing, schema enforcement, canary detection, ...) with a ring-based pipeline. Write your own in Python or WASM.

Quick Start

30 seconds with Docker (no clone, no install)

docker run --rm -p 8090:8090 \
  -e LLM_PROXY_API_KEYS=sk-proxy-test \
  ghcr.io/fabriziosalmi/llmproxy:latest

That's it. Open http://localhost:8090/ui and the first-run wizard walks you through adding a provider (OpenAI, Anthropic, Ollama, etc.). The proxy boots in onboarding mode with zero endpoints — inference returns 503 until you add one.

Drop-in OpenAI replacement, once an endpoint is configured:

curl http://localhost:8090/v1/chat/completions \
  -H "Authorization: Bearer sk-proxy-test" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}'

For persistent state (budget tracking, audit log, registered endpoints) across container restarts, mount a volume and pin the version:

docker run -d --name llmproxy -p 8090:8090 \
  -e LLM_PROXY_API_KEYS=sk-proxy-test \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -v llmproxy-data:/app/data \
  ghcr.io/fabriziosalmi/llmproxy:1.21.52

Each release publishes :latest, the full semver (:X.Y.Z), the minor (:X.Y), plus a per-commit short SHA tag for reproducible deploys.

Or, build from source

git clone https://github.com/fabriziosalmi/llmproxy && cd llmproxy
./install.sh                        # Interactive — checks Python/Docker, creates .env, starts the proxy

The installer detects your platform, verifies prerequisites, generates a proxy auth key, and boots the service via Docker Compose v2 (preferred) or a local Python 3.12+ virtualenv. Use ./install.sh --docker, ./install.sh --local, or ./install.sh --check for non-interactive flows. Choose this path if you want to modify plugins, contribute, or run without an internet connection to GHCR.

Prerequisites

Docker path: Docker Engine + Docker Compose v2 plugin (docker compose). The legacy docker-compose v1 (Debian/Ubuntu apt) is NOT supported — it's incompatible with modern urllib3. On Debian/Ubuntu: sudo apt install docker-compose-plugin.
Local path: Python 3.12+ (Ubuntu 22.04 only ships 3.10 — install from the deadsnakes PPA or use the Docker path).

Local / self-hosted OpenAI-compatible endpoints via `.env`

Declare LM Studio, vLLM, TGI, Ollama, or any OpenAI-compatible endpoint directly in .env — no YAML editing required:

LLM_PROXY_ENDPOINT_LMSTUDIO_URL=http://192.168.1.50:1234/v1
LLM_PROXY_ENDPOINT_LMSTUDIO_MODELS=llama-3.3-70b,qwen-2.5-coder-32b
# LLM_PROXY_ENDPOINT_LMSTUDIO_KEY=  # leave blank for no-auth local servers

Disabling the WAF (dev / integration tests)

The byte-level ASGI firewall is on by default. Disable via env or config when fronting the proxy with another WAF or debugging a false positive:

LLM_PROXY_FIREWALL_ENABLED=0        # in .env, or
# config.yaml:
#   security:
#     firewall:
#       enabled: false

The admin UI reflects the live WAF state and the reason it's off. The switch is env/config-only by design — a one-click UI toggle would make L1 injection defense trivially removable.

Architecture

Client Request
  |
  +-- RateLimitMiddleware         Token bucket per IP/key (O(1) LRU, 50k max)
  +-- ByteLevelFirewall           178 signatures, 8 encoding layers, iterative chain decoding
  +-- CORSMiddleware
  +-- Global Auth (fail-closed)   Deny-all for /api/v1/*, /admin/*, /metrics
  +-- SecurityShield              Injection scoring, PII masking, trajectory analysis
  |     +-- ThreatLedger          Cross-session IP + key aggregation
  |     +-- SemanticAnalyzer      157 patterns, 20+ languages, leetspeak normalization
  |
  +-- Ring 1: INGRESS             Auth, Zero-Trust, rate limiting
  +-- Ring 2: PRE-FLIGHT          PII masking, budget guard, cache, complexity scoring
  +-- Ring 3: ROUTING             Model selection, load balancing, A/B routing
  +-- Upstream Provider           Automatic format translation + fallback chain
  +-- Ring 4: POST-FLIGHT         Response sanitization, quality gate, schema enforcement
  +-- Ring 5: BACKGROUND          Telemetry, export, shadow traffic
  |
Client Response

Providers

OpenAI, Anthropic, Google (Gemini), Azure OpenAI, Ollama, Groq, Together, Mistral, DeepSeek, xAI (Grok), Perplexity, Fireworks, OpenRouter, SambaNova. Each with a dedicated adapter that handles request/response format translation, streaming, and error mapping.

Smart Routing

Endpoints are scored using an EMA-weighted formula: score = (success^2 / latency) * cost_factor^w. The proxy automatically routes to the best-scoring endpoint, with configurable fallback chains (e.g., GPT-4o fails -> Claude Sonnet -> Gemini Pro). When daily budget is exhausted, requests are auto-downgraded to a local model (Ollama).

Security

Layer	What it does
ASGI Firewall	178 injection signatures (162 banned + 16 ROT13) across 8 encoding layers (URL, Unicode, Base64, hex, ROT13) with iterative chain decoding. Loaded from `data/signatures.yaml` (hot-reloadable).
SecurityShield	Threat scoring (8 regex patterns, threshold 0.7), multi-turn trajectory detection, cross-session ThreatLedger.
Semantic Analyzer	157-pattern trigram Jaccard corpus across 20+ languages. Leetspeak normalization, Cyrillic/Greek confusable mapping. Bounded executor with 5s timeout.
PII Detection	Dual-mode: Presidio NLP (18 entity types) or regex fallback (email, phone, SSN, credit card, IBAN, IP, API keys). Vault-based mask/demask roundtrip.
Response Sanitization	Entropy guard, steganography detection (bidi overrides, zero-width chars, homoglyphs), prompt leak detection.
Audit Ledger	SHA256 hash-chained audit log with tamper detection. GDPR compliance: right to erasure, DSAR export, configurable retention.

Auth: API keys, OIDC/JWT (Google, Microsoft, Apple), mTLS, Tailscale Zero-Trust. RBAC with four roles (admin, operator, user, viewer).

HMAC-SHA256 response signing proves the response was not modified after leaving the proxy.

See SECURITY.md for the full security architecture and vulnerability disclosure policy.

OWASP LLM Top 10 coverage

A curated adversarial corpus runs as a regression test on every build. Current per-category pass rate against tests/corpus/owasp_llm_top10.yaml:

Category	Coverage	Notes
LLM01 — Prompt Injection	100 %	All 12 corpus variants caught: direct, base64/hex/zero-width-encoded, leetspeak, role-play, suffix-injection, chain-of-thought, indirect tool-use
LLM02 — Sensitive Info (PII)	100 %	Email · SSN · Visa · Amex · IBAN · phones · API keys
LLM07 — System Prompt Leakage	100 %	Direct + indirect + continuation + translation + meta-instruction + persona-rebase
Benign false-positive rate	10 %	Meta-discussion of attacks ("explain how prompt injection works") trips on purpose

LLM03/04/06/08/09/10 are out-of-scope for the proxy itself (build-time, training-time, caller-side, model-side) — documented as N/A in the report.

Full per-entry results + known gaps + reproduction steps: docs/OWASP_LLM_COVERAGE.md. Re-generate with pytest tests/test_owasp_corpus.py.

The corpus deliberately includes the AI-judgment-bypass path: deterministic checks only. The ai_analyze_threat gray-zone escalation (when configured) catches a fraction of the listed gaps in real deployments, but it depends on an upstream model being available — so it doesn't ship in the regression number.

Performance

Single-process throughput on Apple Silicon (M-series, dev mode, no upstream call — proxy stack only):

Endpoint	Req/s	p50 latency	p99 latency	Conditions
`/health` (cold path, no upstream)	1,313	7 ms	28 ms	wrk · 2t · 10c · 20s
`/health` (saturated)	1,176	82 ms	149 ms	wrk · 4t · 100c · 30s
`/api/v1/registry` (light DB read)	1,158	81 ms	188 ms	wrk · 4t · 100c · 30s

These numbers measure the proxy stack overhead — the auth middleware, ASGI firewall, route dispatch, and JSON serialization — not the cost of a real LLM call (which is dominated by upstream provider latency).

Honest read: ~1.2k req/s on a single process is a moderate-load number. For higher throughput, run multiple uvicorn workers behind a load balancer or scale horizontally. The proxy is stateless except for the SQLite store (which can be swapped for Postgres) and the in-memory rate-limit/circuit-breaker state (which is per-process by design).

Reproduce: python main.py then wrk -t4 -c100 -d30s --latency http://localhost:8090/health.

API

LLMProxy exposes an OpenAI-compatible API on port 8090.

Inference

Endpoint	Method	Description
`/v1/chat/completions`	`POST`	Chat completion (streaming + non-streaming). 15 providers.
`/v1/completions`	`POST`	Legacy text completion.
`/v1/embeddings`	`POST`	Embeddings (OpenAI, Google, Ollama, Azure).
`/v1/models`	`GET`	Model discovery (aggregated from all providers).
`/health`	`GET`	Liveness probe.
`/metrics`	`GET`	Prometheus metrics.

Administration

Endpoint	Method	Description
`/api/v1/registry`	`GET`	Endpoint pool state.
`/api/v1/registry/{id}/toggle`	`POST`	Enable/disable an endpoint.
`/api/v1/proxy/toggle`	`POST`	Enable/disable the proxy.
`/api/v1/panic`	`POST`	Emergency kill switch.
`/api/v1/features`	`GET`	Security guard feature flags.
`/api/v1/features/toggle`	`POST`	Toggle a guard.
`/api/v1/analytics/spend`	`GET`	Spend breakdown by model/provider/key/date.
`/api/v1/audit`	`GET`	Audit log query with filters.
`/api/v1/audit/verify`	`GET`	Verify audit chain integrity.
`/api/v1/plugins`	`GET`	List installed plugins.
`/api/v1/plugins/install`	`POST`	Install a plugin (AST-scanned, hot-swapped).
`/api/v1/gdpr/erase/{subject}`	`POST`	Right to erasure (Article 17).
`/api/v1/gdpr/export/{subject}`	`GET`	Data subject access request (Article 15).

Full API reference in the docs.

Plugins

Ring-based pipeline with 18 marketplace plugins and 10 built-in defaults.

Plugin	Ring	Description
Smart Budget Guard	Pre-Flight	Per-session/team budget with SQLite persistence.
Agentic Loop Breaker	Pre-Flight	Detects AI agents stuck in retry loops.
Model Downgrader	Pre-Flight	Auto-downgrades expensive models for simple prompts.
Context Window Guard	Pre-Flight	Blocks requests exceeding model context limit.
Topic Blocklist	Pre-Flight	Keyword/regex topic filtering.
Tool Guard	Pre-Flight	Strips restricted tools from agentic requests.
A/B Model Router	Routing	Routes traffic percentage to variant model.
Tenant QoS Router	Routing	Routes by tenant tier (free/basic/premium).
Response Quality Gate	Post-Flight	Detects empty, refused, or truncated responses.
Canary Detector	Post-Flight	Detects system prompt leakage.
Schema Enforcer	Post-Flight	Validates JSON responses against schema.
Shadow Traffic	Background	Dark-launch to shadow model for comparison.

Write your own:

from core.plugin_sdk import BasePlugin, PluginResponse, PluginHook

class MyPlugin(BasePlugin):
    name = "my_plugin"
    hook = PluginHook.PRE_FLIGHT
    version = "1.0.0"

    async def execute(self, ctx):
        return PluginResponse.passthrough()

WASM plugins (Rust/Go/C) are supported via Extism for untrusted code execution. See plugins/ for the full development guide.

Configuration

server:
  host: 0.0.0.0
  port: 8090
  auth: { enabled: true, api_keys_env: "LLM_PROXY_API_KEYS" }

endpoints:
  openai:
    provider: "openai"
    base_url: "https://api.openai.com/v1"
    api_key_env: "OPENAI_API_KEY"
    models: ["gpt-4o", "gpt-4o-mini"]
  anthropic:
    provider: "anthropic"
    base_url: "https://api.anthropic.com/v1"
    api_key_env: "ANTHROPIC_API_KEY"
    models: ["claude-sonnet-4-20250514"]

fallback_chains:
  "gpt-4o":
    - { provider: anthropic, model: "claude-sonnet-4-20250514" }
    - { provider: google, model: "gemini-2.5-pro" }

budget:
  daily_limit: 50.0
  fallback_to_local_on_limit: true

rate_limiting:
  enabled: true
  requests_per_minute: 60

All secrets are loaded from environment variables (Infisical SDK supported). See config.yaml for the full reference.

Frontend

Real-time Security Operations Center UI at /ui.

View	What it shows
Threats	KPI cards, threat timeline chart, ring latency (P50/P95/P99), live SSE event feed
Guards	Master proxy toggle, per-guard enable/disable with descriptions
Plugins	Pipeline grid with per-plugin stats, install/uninstall/hot-swap
Models	Aggregated model registry with search/filter
Analytics	Spend breakdown by model and provider
Security	Audit chain verification, GDPR controls, semantic corpus stats
Endpoints	Registry table with circuit breaker state, priority, toggle/delete
Live Logs	xterm.js terminal with WebGL rendering and JSON syntax highlighting
Settings	Identity, RBAC matrix, webhooks, data export

Keyboard shortcuts: Cmd+K (command palette), F (cinema mode). URL hash routing (#/guards, #/logs, ...).

Observability

Prometheus -- 10 metrics (requests, errors, latency percentiles, TTFT, tokens, cost, budget, circuit state, injection blocks, auth failures). Pre-built Grafana dashboard and alert rules in monitoring/.
OpenTelemetry -- Distributed tracing via OTLP. Graceful degradation when not installed.
Sentry -- Exception tracking with PII filtering and sampling.
Webhooks -- Slack, Teams, Discord, Generic (JSON). HMAC-SHA256 signed. SSRF-protected.
Dataset Export -- Async JSONL with PII scrubbing, gzip rotation, optional Parquet conversion.

Testing

make test       # 1183 tests, ~22s
make bench      # 22 performance benchmarks
make lint       # ruff
make typecheck  # mypy

1183 tests across 50+ modules: unit, HTTP integration, pipeline E2E, property-based fuzz (Hypothesis), 31 mathematical invariant proofs, concurrency stress tests, and performance benchmarks.

The invariant suite proves correctness properties (Jaccard axioms, normalize idempotence, token conservation, budget accounting, adapter determinism) and blocks merge on violation.

Production Checklist

Setting	Default	Production
TLS	Disabled	Enable or use a reverse proxy (Traefik, Caddy, nginx)
CORS	`["*"]`	Restrict to your frontend origin(s)
Auth	Enabled	Keep enabled, rotate API keys
API keys	Placeholder	Replace with strong keys
Presidio	Not installed	`pip install presidio-analyzer presidio-anonymizer` for NLP PII
tiktoken	Not installed	`pip install tiktoken` for accurate token counting

The proxy logs warnings at startup when TLS is disabled or CORS is unrestricted.

For hardened deployments, pair with secure-proxy-manager for network-level egress filtering (domain whitelisting, direct IP blocking, IMDS protection).

CI/CD

GitHub Actions runs 8 jobs on every push: lint (ruff), type check (mypy), dependency audit (pip-audit), supply chain scan (.pth malware + blocked packages), syntax check, test suite with coverage gate (65%), mathematical invariants, and Docker image size check.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.devcontainer		.devcontainer
.github		.github
core		core
data		data
docs		docs
monitoring		monitoring
plugins		plugins
proxy		proxy
scripts		scripts
store		store
tests		tests
ui		ui
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
VERSION		VERSION
alembic.ini		alembic.ini
config.minimal.yaml		config.minimal.yaml
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
install.sh		install.sh
lefthook.yml		lefthook.yml
main.py		main.py
models.py		models.py
mypy.ini		mypy.ini
proxy.sh		proxy.sh
py.typed		py.typed
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml
smoke_test.py		smoke_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMProxy

Why LLMProxy

Quick Start

30 seconds with Docker (no clone, no install)

Or, build from source

Prerequisites

Local / self-hosted OpenAI-compatible endpoints via `.env`

Disabling the WAF (dev / integration tests)

Architecture

Providers

Smart Routing

Security

OWASP LLM Top 10 coverage

Performance

API

Inference

Administration

Plugins

Configuration

Frontend

Observability

Testing

Production Checklist

CI/CD

License

About

Uh oh!

Releases 24

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLMProxy

Why LLMProxy

Quick Start

30 seconds with Docker (no clone, no install)

Or, build from source

Prerequisites

Local / self-hosted OpenAI-compatible endpoints via .env

Disabling the WAF (dev / integration tests)

Architecture

Providers

Smart Routing

Security

OWASP LLM Top 10 coverage

Performance

API

Inference

Administration

Plugins

Configuration

Frontend

Observability

Testing

Production Checklist

CI/CD

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 24

Uh oh!

Contributors

Uh oh!

Languages

Local / self-hosted OpenAI-compatible endpoints via `.env`