Skip to content

Latest commit

 

History

History
354 lines (250 loc) · 18.7 KB

File metadata and controls

354 lines (250 loc) · 18.7 KB

Stop Overpaying for Claude: How ClawRouter Cuts Your Anthropic Bill by 70%

You love Claude. Your wallet doesn't. Here's how to keep frontier-quality answers — at a fraction of the cost.


The Problem: Claude Is Brilliant, But Expensive

If you're building with the Anthropic API, you already know Claude is the best reasoning model available. Opus 4.6 runs $5/$25 per million tokens. Sonnet at $3/$15. Even Haiku costs $1/$5.

But here's what most developers won't admit: the majority of your API calls don't need Claude.

Think about your typical workload. You're building a SaaS app. Some requests need Claude's reasoning — debugging complex code, analyzing long documents, orchestrating multi-step agent workflows. But most requests are mundane: extracting JSON from text, answering simple user questions, translating a string, summarizing a paragraph.

You're paying $3-25 per million tokens for work that a $0.10 model handles identically.

The problem is simple: you're paying Claude rates on 100% of your requests, but only ~30% of them need Claude.


What Does a Typical Developer Workload Look Like?

The Everyday Tasks (~70% of requests)

These are the requests you fire off constantly and barely think about:

  • "Extract the name and email from this text and return JSON" — Any model can do this. You're paying Claude $15/M output tokens for structured extraction that a $0.40 model handles perfectly.

  • "Summarize this customer support ticket in 2 sentences" — Summarization is a solved problem. You don't need frontier reasoning here.

  • "Translate this error message to Spanish" — Translation is a commodity task. Paying Claude rates for it is like taking a Lamborghini to the grocery store.

  • "What's the difference between useEffect and useLayoutEffect?" — Factual Q&A. Every model gets this right.

  • "Convert this CSV data to a markdown table" — Pure formatting. A free model does this identically.

The Tasks That Actually Need Claude (~30% of requests)

This is where you're paying for real value:

  • Complex code generation — "Refactor this authentication module to support OAuth2 + PKCE, handle token refresh, and add rate limiting." Multi-file, multi-constraint reasoning. Claude earns its price here.

  • Long-document analysis — "Read this 50-page contract and identify all clauses that could expose us to liability over $1M." Context window + reasoning quality matter.

  • Multi-step agent orchestration — "Scan these 5 APIs, cross-reference the data, and generate a report with recommendations." Agentic workflows where the model needs to maintain a plan across many steps.

  • Advanced reasoning — "Debug this race condition in our distributed system" or "Prove this algorithm is O(n log n)." Tasks where cheaper models lose the thread.


The Solution: ClawRouter

ClawRouter is an open-source local proxy that sits between your app and 41+ AI models. It saves you money in three ways: smart routing, token optimization, and response caching.

┌─────────────┐     ┌──────────────────────────────┐     ┌──────────────────┐
│  Your App    │────▶│       ClawRouter              │────▶│  41+ AI Models   │
│  (OpenAI     │     │       (local proxy)           │     │                  │
│   SDK)       │     │                               │     │  FREE  (11 free) │
│              │     │  1. Route to cheapest model    │     │  $0.10 (gemini)  │
│  model:      │     │  2. Compress tokens            │     │  $3.00 (sonnet)  │
│  "auto"      │     │  3. Cache repeated requests    │     │  $0.20 (grok)    │
└─────────────┘     └──────────────────────────────┘     └──────────────────┘

How You Save: Three Layers

Layer 1: Smart Routing (the biggest win)

ClawRouter scores every prompt against 14 dimensions in <1ms and routes it to the cheapest model that can handle the task.

"What is the capital of France?"
  → SIMPLE → nvidia/gpt-oss-120b (FREE)

"Extract JSON from this text"
  → SIMPLE → nvidia/gpt-oss-120b (FREE)

"Refactor this auth module with OAuth2 + PKCE"
  → COMPLEX → anthropic/claude-sonnet-4.6 ($3/$15)

"Prove sqrt(2) is irrational, show every step"
  → REASONING → xai/grok-4-1-fast-reasoning ($0.20/$0.50)

From real production data across 20,000+ paying user requests:

Model % of Requests Price (input/output per M)
gemini-2.5-flash-lite 34.5% $0.10 / $0.40
claude-sonnet-4.6 22.7% $3.00 / $15.00
kimi-k2.5 16.2% $0.60 / $3.00
minimax-m2.5 6.5% $0.30 / $1.20
grok-code-fast 6.1% $0.20 / $1.50
claude-haiku-4.5 2.7% $1.00 / $5.00
nvidia/gpt-oss-120b 2.1% FREE
grok-reasoning 2.9% $0.20 / $0.50
Others 6.3% varies

Result: 77% of requests go to models that cost 5-150x less than Sonnet. Only the ~23% that genuinely need Claude still go to Claude.

Layer 2: Token Compression (saves on every request)

Even when a request does go to Claude, ClawRouter reduces the tokens you pay for. The proxy runs a multi-layer compression pipeline on your request before sending it to the provider — and you pay based on the compressed token count, not the original.

How it works:

Compression Layer What It Does Savings
Deduplication Removes duplicate messages in conversation history 2-5%
Whitespace normalization Strips excess whitespace, trailing spaces, empty lines 3-8%
JSON compaction Minifies JSON in tool calls and results 2-4%

These three layers are enabled by default and are completely safe — they don't change semantic meaning. The compression triggers automatically on requests larger than 180KB (common in agent workflows and long conversations).

For agent-heavy workloads (long tool outputs, multi-turn conversations), the savings are even larger. An optional observation compression layer can reduce massive tool outputs by up to 97% — turning 10KB of verbose log output into 300 characters of essential information.

Typical combined savings: 7-15% fewer tokens per request. On long-context agent workloads: 20-40%.

This matters most on expensive models. If you're sending a 50K-token agent conversation to Claude Sonnet, 15% compression saves ~$0.03 per request — that adds up to real money at scale.

Layer 3: Response Cache + Request Deduplication (saves 100%)

ClawRouter caches responses locally. If your app sends the same request within 10 minutes, you get an instant response at zero cost — no API call, no tokens billed.

This is more common than you'd think:

  • Retry logic — Your app retries on timeout. Without dedup, you pay twice. With ClawRouter, the retry resolves from cache instantly.
  • Redundant requests — Multiple users or processes asking the same thing? One API call, multiple responses.
  • Agent loops — Agentic frameworks often re-query with identical context. Cache catches these.
Request 1: "Summarize this document" → API call → $0.02 → cached
Request 2: "Summarize this document" → cache hit → $0.00 → instant
Request 3: "Summarize this document" → cache hit → $0.00 → instant

The deduplicator also catches in-flight duplicates: if two identical requests arrive simultaneously, only one goes to the provider. Both callers get the same response.


The Cost Math (Honest Numbers)

10,000 mixed requests per month, averaging 1,000 input tokens and 500 output tokens each.

Direct Anthropic API

Approach Input (10M tokens) Output (5M tokens) Monthly Total
All Claude Sonnet $30.00 $75.00 $105.00
All Claude Opus $50.00 $125.00 $175.00

ClawRouter (real paying-user distribution)

Tier % Requests Routed To Cost
Cheap models 34.5% gemini-flash-lite $0.76
Mid-tier 16.2% kimi-k2.5 $2.43
Claude (complex) 22.7% claude-sonnet-4.6 $17.44
Code models 6.1% grok-code-fast $0.52
Reasoning 2.9% grok-reasoning $0.03
Haiku 2.7% claude-haiku-4.5 $0.76
Free 2.1% nvidia/gpt-oss-120b $0.00
Other 12.8% various $1.18
Subtotal (routing) $23.12
Token compression (~10%) -$2.31
Cache hits (~5% est.) -$1.16
Final Total ~$19.65

The Bottom Line

Approach Monthly Cost Savings
Direct Claude Sonnet $105.00
Direct Claude Opus $175.00
ClawRouter ~$20 ~81% vs Sonnet, ~89% vs Opus

Breaking down where the savings come from:

Savings Source Estimated Impact How
Smart routing ~68% cost reduction 77% of requests → cheaper models
Token compression ~7-15% on remaining cost Fewer tokens billed per request
Response cache ~3-5% additional Repeat requests cost $0
Request dedup Prevents overcharges Retries don't double-bill

How the 14-Dimension Router Works

ClawRouter runs a weighted scoring algorithm on every prompt — entirely locally, in under 1 millisecond, zero external API calls.

Dimension Weight Detects
Reasoning Markers 0.18 "prove," "step by step," "analyze"
Code Presence 0.15 function, class, import, code blocks
Multi-Step Patterns 0.12 "first...then," numbered steps
Technical Terms 0.10 Domain-specific vocabulary
Token Count 0.08 Short vs. long context
Question Complexity 0.05 Nested or compound questions
Creative Markers 0.05 Creative writing indicators
Constraint Count 0.04 "max," "minimum," "at most"
Imperative Verbs 0.03 "create," "generate," "build"
Output Format 0.03 JSON, YAML, table, markdown
Simple Indicators 0.02 "what is," "define," "translate"
Reference Complexity 0.02 "the code above," "the docs"
Domain Specificity 0.02 Quantum, genomics, etc.
Negation Complexity 0.01 "don't," "never," "avoid"

The weighted score maps to four tiers:

Score < 0.0   →  SIMPLE     →  Free or ultra-cheap models
Score 0.0–0.3 →  MEDIUM     →  Mid-tier (Kimi K2.5, DeepSeek)
Score 0.3–0.5 →  COMPLEX    →  Frontier (Claude Sonnet, Gemini Pro)
Score > 0.5   →  REASONING  →  Specialized (Grok Reasoning, DeepSeek-R)

Multilingual support across 9 languages. Tool-calling and vision requests automatically filter for compatible models. If the primary model fails, a fallback chain tries alternatives before returning an error.


Getting Started: 3 Minutes

Step 1: Install

npx @blockrun/clawrouter

Starts a local proxy on port 8402. Auto-generates a crypto wallet. Done.

Step 2: Update Your Code

Python — change 2 lines:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8402/v1",  # ← was: https://api.anthropic.com
    api_key="unused"                       # ← ClawRouter handles auth
)

response = client.chat.completions.create(
    model="blockrun/auto",                 # ← was: claude-sonnet-4.6
    messages=[{"role": "user", "content": "Your prompt here"}]
)

TypeScript — same idea:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8402/v1",
  apiKey: "unused",
});

const response = await client.chat.completions.create({
  model: "blockrun/auto", // or "eco" for max savings, "premium" for best quality
  messages: [{ role: "user", content: "Your prompt here" }],
});

Routing profiles:

  • blockrun/auto — Balanced cost/quality (default)
  • blockrun/eco — Maximum savings (free tier aggressively)
  • blockrun/premium — Best quality (Opus/Sonnet/GPT-5)
  • blockrun/free — Free tier only (gpt-oss-120b)

Step 3: Fund (optional)

# Your wallet address is shown on startup.
# Send any amount of USDC on Base chain to that address.
# $1 is enough for hundreds of requests.
# Or start with $0 — the free tier model works immediately.

That's it. Your existing code works. Your output quality on complex tasks stays the same.

Check Your Savings

$ /stats 7

╔═══════════════════════════════════════════════════════╗
║        ClawRouter v0.12.12 — Usage Statistics         ║
╠═══════════════════════════════════════════════════════╣
║  Period: last 7 days                                  ║
║  Total Requests: 1,523                                ║
║  Actual Cost:    $12.35                               ║
║  Baseline Cost:  $156.23 (if all went to Opus 4.6)   ║
║  Saved: $143.89 (92.1%)                               ║
╠═══════════════════════════════════════════════════════╣
║  SIMPLE     ████████████████     50.2%  (765 reqs)   ║
║  MEDIUM     ██████████           28.5%  (434 reqs)   ║
║  COMPLEX    ██████               15.0%  (228 reqs)   ║
║  REASONING  ██                    6.3%   (96 reqs)   ║
╚═══════════════════════════════════════════════════════╝

Why ClawRouter Instead of OpenRouter?

ClawRouter OpenRouter
Smart routing Automatic — 14-dimension scorer picks the model Manual — you pick the model
Token optimization Built-in compression (7-15% savings) None
Response caching Local cache, repeat requests = $0 None
Request dedup Retries don't double-bill None
Routing latency <1ms (local, on your machine) Additional network hop
Payments Non-custodial USDC on Base (your wallet, your keys) Prepaid credit balance (custodial)
Free tier GPT-OSS-120B (always available) No free models
API keys Zero — proxy handles all auth You manage keys per provider
Algorithm Open-source, MIT license, modify it yourself Proprietary

The fundamental difference: OpenRouter is a model marketplace where you choose. ClawRouter is an intelligent proxy that chooses for you, compresses your tokens, caches your responses, and pays per-request with crypto from your own wallet.


TL;DR

What Details
Problem You pay Claude $3-25/M tokens on every request, but ~70% don't need Claude
Solution ClawRouter auto-routes + compresses + caches
Savings ~81% vs Sonnet, ~89% vs Opus
How Routing (68%) + token compression (7-15%) + caching (3-5%)
Code change 2 lines (base_url + model name)
Setup time 3 minutes
Quality tradeoff None — complex tasks still go to Claude
Open source MIT license, local proxy, non-custodial payments
# Start saving now:
npx @blockrun/clawrouter

Links:


Cost data based on real production traffic from paying users across 20,000+ requests, March 2026. Savings vary by workload — agent-heavy and long-context workloads see larger compression benefits. ClawRouter is open-source and part of the BlockRun ecosystem.