The AI industry defaults to "bigger is better" - GPT-4, Claude Opus, Llama 70B. But for most production workloads, 80% of LLM calls don't need a 100B+ parameter model. They need a function routed, a tool selected, a query classified, or a simple response generated.
Small LLMs (under 4B parameters) solve this by running locally, for free, in milliseconds.
Consider an agent that makes 10 tool-selection calls per user request:
| Approach | Cost per call | 10 calls | 1M requests/month |
|---|---|---|---|
| Claude Opus (cloud) | ~$0.015 | $0.15 | $150,000 |
| GPT-4o (cloud) | ~$0.005 | $0.05 | $50,000 |
| GPT-4o-mini (cloud) | ~$0.0003 | $0.003 | $3,000 |
| Qwen3 4B (local GPU) | $0 (infra only) | $0 | ~$200 (GPU server) |
| Qwen3 0.6B (local CPU) | $0 (infra only) | $0 | ~$30 (CPU server) |
| FunctionGemma (local CPU) | $0 (infra only) | $0 | ~$30 (CPU server) |
A $30/month CPU server replaces $50,000+/month in API costs for tool-routing workloads.
- Tool/function calling - Mapping user intent to a function name + arguments
- Intent classification - Routing queries to the right service
- Query parsing - Extracting structured data from natural language
- Simple Q&A with tools - Answering questions by calling APIs
- Edge/IoT deployment - Running AI on devices with no internet
- High-throughput pipelines - Processing thousands of requests per second
- Privacy-sensitive workloads - Data never leaves your infrastructure
- Development and testing - Fast iteration on AI pipelines without API costs
- Complex multi-step reasoning across long documents
- High-quality creative writing or content generation
- Tasks requiring deep world knowledge
- Nuanced understanding of ambiguous instructions
- Multi-modal tasks (vision + text + audio)
The real power is not choosing one model - it's layering them:
User Request
|
v
+------------------------+
| Tier 1: Router |
| FunctionGemma 270M |
| Cost: $0 | <500ms |
+------------------------+
| |
Simple tool Needs reasoning
| |
v v
+-------------+ +------------------------+
| Execute tool| | Tier 2: Thinker |
| Return | | Qwen3 0.6B |
+-------------+ | Cost: $0 | ~1.4s |
+------------------------+
| |
Handled Too complex
| |
v v
+-------------+ +------------------------+
| Return | | Tier 3: Reasoner |
+-------------+ | Qwen3 4B / Cloud |
| Cost: $0-$$ | ~5s |
+------------------------+
This tiered approach means:
- 60-70% of requests are handled by Tier 1 (free, sub-second)
- 20-25% of requests are handled by Tier 2 (free, ~1.4 seconds)
- 5-10% of requests hit Tier 3 (local GPU or paid API)
| Model | Type | Size | Parameters | Context Window |
|---|---|---|---|---|
| FunctionGemma 270M | Function calling specialist | 300 MB | 268M | 32K |
| Qwen3 0.6B | Smallest thinking agent | 522 MB | 752M | 40K |
| Qwen3 4B | Full agent brain | 2.5 GB | 4.0B | 256K |
| Tool | Description | Tests Intent |
|---|---|---|
get_weather(city) |
Simulated weather lookup | Entity extraction from natural language |
add_numbers(a, b) |
Calculator | Numeric parameter extraction |
search_contacts(name) |
Contact directory lookup | Name extraction and matching |
| Query | Expected Tool | Tests |
|---|---|---|
| "What is the weather in Tokyo?" | get_weather | Simple single-tool routing |
| "Add 25 and 17" | add_numbers | Numeric intent parsing |
| "Find contact info for Alice" | search_contacts | Name entity extraction |
| "Weather in London and find Bob's contact" | get_weather + search_contacts | Multi-tool in single query |
| Query | FunctionGemma | Qwen3 0.6B | Qwen3 4B |
|---|---|---|---|
| Weather query | Pass | Pass | Pass |
| Calculator query | Pass | Pass | Pass |
| Contact lookup | Pass | Pass | Pass |
| Multi-tool query | Pass | Pass | Pass |
| Accuracy | 4/4 | 4/4 | 4/4 |
| Metric | FunctionGemma | Qwen3 0.6B | Qwen3 4B |
|---|---|---|---|
| Model size | 300 MB | 522 MB | 2.5 GB |
| Tokens/sec | N/A | ~107 | ~39 |
| Avg response (warm) | ~500 ms | ~1,400 ms | ~5,000 ms |
| Multi-tool response | ~600 ms | ~2,000 ms | ~10,000 ms |
| Eval tokens/query | 10-15 | 90-186 | 170-400 |
| Can think/reason | No | Yes | Yes |
| Can chat about results | No | Yes | Yes |
| CPU viable | Yes | Yes | Slow |
| GPU required | No | No | Recommended |
| Advantage | Impact |
|---|---|
| Zero API cost | No per-token billing. Pay only for infrastructure. |
| Data privacy | Nothing leaves your network. No vendor data retention policies. |
| Low latency | Sub-second to few-second responses. No network roundtrip. |
| Predictable performance | No rate limits, no API outages, no quota exhaustion. |
| Offline capable | Works without internet. Edge, IoT, air-gapped environments. |
| Simple infrastructure | Single binary (Ollama) + model file. No complex deployments. |
| Fine-tunable | Customize for your exact domain and tool set. |
| No vendor lock-in | Open weights. Switch providers, frameworks, or hardware freely. |
| Fast iteration | Test agent pipelines locally without burning API credits. |
| Horizontal scaling | Add more cheap CPU servers instead of buying expensive GPU clusters. |
| Limitation | Mitigation |
|---|---|
| Lower accuracy on complex tasks | Use tiered architecture - route complex tasks to larger models |
| Limited reasoning depth | Enable thinking mode (Qwen3) or fine-tune for your domain |
| Smaller context windows | 32K-40K is enough for tool calling; use Qwen3 4B (256K) for long docs |
| Weaker text generation | Don't use for user-facing prose; use for structured tool calls |
| Fine-tuning needed for production | Budget 1-2 days for data prep + training on your tool set |
| Model management overhead | Use Ollama for simple model lifecycle management |
| No multi-modal support | Use cloud APIs for vision/audio tasks (these are rare in tool-calling) |
| Hallucination risk on edge cases | Validate tool call outputs; don't blindly trust parameter extraction |
1. Models are getting smaller, not bigger
The trend in 2025-2026 is clear: Qwen3 0.6B matches Qwen2.5-3B. Qwen3 4B rivals Qwen2.5-72B. Every generation, the same capability fits in a smaller package. Building infrastructure for small models today means you benefit from every future improvement automatically.
2. Edge AI is the next frontier
As AI moves from cloud to device (phones, cars, IoT, wearables), small models become the default. Teams that understand small model deployment today will lead when edge AI becomes mainstream.
3. Regulation favors local deployment
GDPR, HIPAA, and emerging AI regulations increasingly require data locality. Small LLMs running on-premise solve compliance by design - no data ever leaves your infrastructure.
4. Cost pressure only increases
As AI adoption grows, API costs scale linearly with usage. Infrastructure costs for local models scale sub-linearly. The gap widens with every user you add.
5. Specialization beats generalization
A fine-tuned 0.6B model on your specific tool set will outperform a general-purpose 70B model for that narrow task. Domain specialization is the real competitive advantage, not model size.
6. Hybrid architectures will dominate
The future isn't "local OR cloud" - it's local for the fast/cheap/private tier, cloud for the complex/rare/expensive tier. Building this architecture now gives you a structural advantage.
- Cost lock-in: 100% dependency on API pricing decisions by OpenAI/Anthropic/Google
- Latency ceiling: Network-bound response times for every LLM call
- Privacy liability: All user data processed by third-party APIs
- Single point of failure: API outage = your product is down
- No differentiation: Same model as every competitor; no domain specialization
TinyLLM-usecases/
|
+-- README.md # This file
|
+-- functionGemma/ # Google FunctionGemma 270M
| +-- README.md # Model details, specs, benchmarks
| +-- tools.py # Tool definitions (shared)
| +-- main.py # Standalone demo
| +-- server.py # FastAPI server (port 8000)
| +-- client.py # API client + response storage
| +-- requirements.txt # Pinned dependencies
| +-- .venv/ # Python virtual environment
|
+-- qwen3-nano/ # Alibaba Qwen3 0.6B
| +-- README.md # Model details, specs, benchmarks
| +-- tools.py # Tool definitions (shared)
| +-- main.py # Standalone demo with telemetry
| +-- server.py # FastAPI server (port 8002)
| +-- client.py # API client + telemetry + storage
| +-- requirements.txt # Pinned dependencies
| +-- responses.json # Stored responses with telemetry
| +-- .venv/ # Python virtual environment
|
+-- qwen3/ # Alibaba Qwen3 4B
+-- README.md # Model details, specs, benchmarks
+-- tools.py # Tool definitions (shared)
+-- main.py # Standalone demo with telemetry
+-- server.py # FastAPI server (port 8001)
+-- client.py # API client + telemetry + storage
+-- requirements.txt # Pinned dependencies
+-- responses.json # Stored responses with telemetry
+-- .venv/ # Python virtual environment
# Prerequisites
# 1. Install Ollama: https://ollama.com
# 2. Pull models:
ollama pull functiongemma:270m
ollama pull qwen3:0.6b
ollama pull qwen3:4b
# 3. Run any project:
cd functionGemma # or qwen3-nano, or qwen3
pip install -r requirements.txt
# Standalone test
python main.py
# Or start API server, then run client
python server.py # Terminal 1
python client.py # Terminal 2| Component | Spec |
|---|---|
| GPU | NVIDIA GeForce RTX 3050 Ti Laptop (4 GB VRAM) |
| CUDA | v13.1 |
| OS | Windows 11 |
| Python | 3.12.7 |
| Ollama | 0.15.6 |
| Layer | Technology |
|---|---|
| LLM Runtime | Ollama |
| LLM Integration | LangChain (langchain-ollama) |
| API Framework | FastAPI + Uvicorn |
| HTTP Client | httpx |
| Data Models | Pydantic v2 |
| Python | 3.12 |
Each model has its own license:
- FunctionGemma: Gemma Terms of Use (commercial allowed)
- Qwen3 0.6B: Apache 2.0 (fully open)
- Qwen3 4B: Apache 2.0 (fully open)
Application code in this repository is free to use.