Skip to content

Ashfaqbs/TinyLLM-usecases

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinyLLM Use Cases - Small LLMs for Production AI

Why Small LLMs Matter

The AI industry defaults to "bigger is better" - GPT-4, Claude Opus, Llama 70B. But for most production workloads, 80% of LLM calls don't need a 100B+ parameter model. They need a function routed, a tool selected, a query classified, or a simple response generated.

Small LLMs (under 4B parameters) solve this by running locally, for free, in milliseconds.

The Math That Changes Everything

Consider an agent that makes 10 tool-selection calls per user request:

Approach Cost per call 10 calls 1M requests/month
Claude Opus (cloud) ~$0.015 $0.15 $150,000
GPT-4o (cloud) ~$0.005 $0.05 $50,000
GPT-4o-mini (cloud) ~$0.0003 $0.003 $3,000
Qwen3 4B (local GPU) $0 (infra only) $0 ~$200 (GPU server)
Qwen3 0.6B (local CPU) $0 (infra only) $0 ~$30 (CPU server)
FunctionGemma (local CPU) $0 (infra only) $0 ~$30 (CPU server)

A $30/month CPU server replaces $50,000+/month in API costs for tool-routing workloads.

When Small LLMs Win

  • Tool/function calling - Mapping user intent to a function name + arguments
  • Intent classification - Routing queries to the right service
  • Query parsing - Extracting structured data from natural language
  • Simple Q&A with tools - Answering questions by calling APIs
  • Edge/IoT deployment - Running AI on devices with no internet
  • High-throughput pipelines - Processing thousands of requests per second
  • Privacy-sensitive workloads - Data never leaves your infrastructure
  • Development and testing - Fast iteration on AI pipelines without API costs

When You Still Need Large LLMs

  • Complex multi-step reasoning across long documents
  • High-quality creative writing or content generation
  • Tasks requiring deep world knowledge
  • Nuanced understanding of ambiguous instructions
  • Multi-modal tasks (vision + text + audio)

The Tiered Architecture

The real power is not choosing one model - it's layering them:

                    User Request
                         |
                         v
            +------------------------+
            |   Tier 1: Router       |
            |   FunctionGemma 270M   |
            |   Cost: $0 | <500ms    |
            +------------------------+
                    |          |
            Simple tool     Needs reasoning
                    |          |
                    v          v
            +-------------+  +------------------------+
            | Execute tool|  |   Tier 2: Thinker      |
            | Return      |  |   Qwen3 0.6B           |
            +-------------+  |   Cost: $0 | ~1.4s     |
                              +------------------------+
                                   |          |
                              Handled     Too complex
                                   |          |
                                   v          v
                              +-------------+  +------------------------+
                              | Return      |  |   Tier 3: Reasoner    |
                              +-------------+  |   Qwen3 4B / Cloud    |
                                               |   Cost: $0-$$ | ~5s   |
                                               +------------------------+

This tiered approach means:

  • 60-70% of requests are handled by Tier 1 (free, sub-second)
  • 20-25% of requests are handled by Tier 2 (free, ~1.4 seconds)
  • 5-10% of requests hit Tier 3 (local GPU or paid API)

What We Tested

Models

Model Type Size Parameters Context Window
FunctionGemma 270M Function calling specialist 300 MB 268M 32K
Qwen3 0.6B Smallest thinking agent 522 MB 752M 40K
Qwen3 4B Full agent brain 2.5 GB 4.0B 256K

Tools Tested (Same Across All Models)

Tool Description Tests Intent
get_weather(city) Simulated weather lookup Entity extraction from natural language
add_numbers(a, b) Calculator Numeric parameter extraction
search_contacts(name) Contact directory lookup Name extraction and matching

Test Queries

Query Expected Tool Tests
"What is the weather in Tokyo?" get_weather Simple single-tool routing
"Add 25 and 17" add_numbers Numeric intent parsing
"Find contact info for Alice" search_contacts Name entity extraction
"Weather in London and find Bob's contact" get_weather + search_contacts Multi-tool in single query

Results: All Models Passed

Query FunctionGemma Qwen3 0.6B Qwen3 4B
Weather query Pass Pass Pass
Calculator query Pass Pass Pass
Contact lookup Pass Pass Pass
Multi-tool query Pass Pass Pass
Accuracy 4/4 4/4 4/4

Performance Comparison (RTX 3050 Ti 4GB VRAM)

Metric FunctionGemma Qwen3 0.6B Qwen3 4B
Model size 300 MB 522 MB 2.5 GB
Tokens/sec N/A ~107 ~39
Avg response (warm) ~500 ms ~1,400 ms ~5,000 ms
Multi-tool response ~600 ms ~2,000 ms ~10,000 ms
Eval tokens/query 10-15 90-186 170-400
Can think/reason No Yes Yes
Can chat about results No Yes Yes
CPU viable Yes Yes Slow
GPU required No No Recommended

Pros and Cons

Small LLMs - Pros

Advantage Impact
Zero API cost No per-token billing. Pay only for infrastructure.
Data privacy Nothing leaves your network. No vendor data retention policies.
Low latency Sub-second to few-second responses. No network roundtrip.
Predictable performance No rate limits, no API outages, no quota exhaustion.
Offline capable Works without internet. Edge, IoT, air-gapped environments.
Simple infrastructure Single binary (Ollama) + model file. No complex deployments.
Fine-tunable Customize for your exact domain and tool set.
No vendor lock-in Open weights. Switch providers, frameworks, or hardware freely.
Fast iteration Test agent pipelines locally without burning API credits.
Horizontal scaling Add more cheap CPU servers instead of buying expensive GPU clusters.

Small LLMs - Cons

Limitation Mitigation
Lower accuracy on complex tasks Use tiered architecture - route complex tasks to larger models
Limited reasoning depth Enable thinking mode (Qwen3) or fine-tune for your domain
Smaller context windows 32K-40K is enough for tool calling; use Qwen3 4B (256K) for long docs
Weaker text generation Don't use for user-facing prose; use for structured tool calls
Fine-tuning needed for production Budget 1-2 days for data prep + training on your tool set
Model management overhead Use Ollama for simple model lifecycle management
No multi-modal support Use cloud APIs for vision/audio tasks (these are rare in tool-calling)
Hallucination risk on edge cases Validate tool call outputs; don't blindly trust parameter extraction

Long-Term Perspective

Why Investing in Small LLMs Now Pays Off Later

1. Models are getting smaller, not bigger

The trend in 2025-2026 is clear: Qwen3 0.6B matches Qwen2.5-3B. Qwen3 4B rivals Qwen2.5-72B. Every generation, the same capability fits in a smaller package. Building infrastructure for small models today means you benefit from every future improvement automatically.

2. Edge AI is the next frontier

As AI moves from cloud to device (phones, cars, IoT, wearables), small models become the default. Teams that understand small model deployment today will lead when edge AI becomes mainstream.

3. Regulation favors local deployment

GDPR, HIPAA, and emerging AI regulations increasingly require data locality. Small LLMs running on-premise solve compliance by design - no data ever leaves your infrastructure.

4. Cost pressure only increases

As AI adoption grows, API costs scale linearly with usage. Infrastructure costs for local models scale sub-linearly. The gap widens with every user you add.

5. Specialization beats generalization

A fine-tuned 0.6B model on your specific tool set will outperform a general-purpose 70B model for that narrow task. Domain specialization is the real competitive advantage, not model size.

6. Hybrid architectures will dominate

The future isn't "local OR cloud" - it's local for the fast/cheap/private tier, cloud for the complex/rare/expensive tier. Building this architecture now gives you a structural advantage.

Risk of NOT Investing

  • Cost lock-in: 100% dependency on API pricing decisions by OpenAI/Anthropic/Google
  • Latency ceiling: Network-bound response times for every LLM call
  • Privacy liability: All user data processed by third-party APIs
  • Single point of failure: API outage = your product is down
  • No differentiation: Same model as every competitor; no domain specialization

Project Structure

TinyLLM-usecases/
|
+-- README.md                    # This file
|
+-- functionGemma/               # Google FunctionGemma 270M
|   +-- README.md                # Model details, specs, benchmarks
|   +-- tools.py                 # Tool definitions (shared)
|   +-- main.py                  # Standalone demo
|   +-- server.py                # FastAPI server (port 8000)
|   +-- client.py                # API client + response storage
|   +-- requirements.txt         # Pinned dependencies
|   +-- .venv/                   # Python virtual environment
|
+-- qwen3-nano/                  # Alibaba Qwen3 0.6B
|   +-- README.md                # Model details, specs, benchmarks
|   +-- tools.py                 # Tool definitions (shared)
|   +-- main.py                  # Standalone demo with telemetry
|   +-- server.py                # FastAPI server (port 8002)
|   +-- client.py                # API client + telemetry + storage
|   +-- requirements.txt         # Pinned dependencies
|   +-- responses.json           # Stored responses with telemetry
|   +-- .venv/                   # Python virtual environment
|
+-- qwen3/                       # Alibaba Qwen3 4B
    +-- README.md                # Model details, specs, benchmarks
    +-- tools.py                 # Tool definitions (shared)
    +-- main.py                  # Standalone demo with telemetry
    +-- server.py                # FastAPI server (port 8001)
    +-- client.py                # API client + telemetry + storage
    +-- requirements.txt         # Pinned dependencies
    +-- responses.json           # Stored responses with telemetry
    +-- .venv/                   # Python virtual environment

Quick Start

# Prerequisites
# 1. Install Ollama: https://ollama.com
# 2. Pull models:
ollama pull functiongemma:270m
ollama pull qwen3:0.6b
ollama pull qwen3:4b

# 3. Run any project:
cd functionGemma   # or qwen3-nano, or qwen3
pip install -r requirements.txt

# Standalone test
python main.py

# Or start API server, then run client
python server.py          # Terminal 1
python client.py          # Terminal 2

Hardware Used for Testing

Component Spec
GPU NVIDIA GeForce RTX 3050 Ti Laptop (4 GB VRAM)
CUDA v13.1
OS Windows 11
Python 3.12.7
Ollama 0.15.6

Tech Stack

Layer Technology
LLM Runtime Ollama
LLM Integration LangChain (langchain-ollama)
API Framework FastAPI + Uvicorn
HTTP Client httpx
Data Models Pydantic v2
Python 3.12

License

Each model has its own license:

  • FunctionGemma: Gemma Terms of Use (commercial allowed)
  • Qwen3 0.6B: Apache 2.0 (fully open)
  • Qwen3 4B: Apache 2.0 (fully open)

Application code in this repository is free to use.

About

a collection of tiny llms with usecases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages