Containerized Local RAG API

A completely localized, offline Retrieval-Augmented Generation (RAG) pipeline. This microservice architecture separates a FastAPI backend from a Gradio frontend, completely containerized via Docker for consistent local deployment — no external APIs or cloud services required.

Architecture & Tech Stack

Framework: FastAPI, Python 3.10
AI Pipeline: LangChain, Hugging Face
Local LLM: HuggingFaceTB/SmolLM2-135M-Instruct (Local CPU execution)
Vector Database: DuckDB (persistent, file-backed)
Containerization: Docker & Docker Compose
Frontend: Gradio (tabbed UI)

Key Features

100% Local Inference: No OpenAI API keys or internet connection required after the initial model download.
Runtime PDF Uploads: Upload any PDF at runtime through the UI or API — no container restart needed. Multiple PDFs are supported simultaneously.
Auto-Ingestion on Boot: PDFs placed in the data/ folder are automatically embedded when the container starts.
Persistent Vector Store: Embeddings are saved to data/vectorstore.duckdb via a Docker volume. On restart, the vector store is restored from disk instantly — no re-embedding.
Ingestion Manifest: A data/manifest.json file tracks every document's state (processing → ready) across container restarts. Crash recovery is handled automatically.
Non-Blocking Ingestion: PDF embedding runs in a background thread so the /chat endpoint stays responsive during uploads.
Upload Safety: Duplicate detection, 50 MB file size limit, and PDF-only validation on the upload endpoint.
Live Status Tracking: Poll GET /api/v1/documents to monitor ingestion progress in real time.
Docker Volume Caching: HuggingFace model weights are cached to hf_cache/ to prevent re-downloading on container restarts.
CPU Optimized: Configured with low_cpu_mem_usage and .safetensors to run reliably on consumer laptops without a GPU.
Modular UI: The Gradio frontend is fully decoupled from the API backend.

Challenges & Engineering Workarounds

Building a local RAG pipeline without cloud GPUs presented several hardware and containerization limits that required specific workarounds:

Docker/WSL Storage Crashes: Downloading massive PyTorch and Hugging Face weights inside an ephemeral Docker container repeatedly corrupted the WSL .vhdx virtual drive. Fix: Migrated the Docker engine to a secondary drive and implemented local volume mapping (./hf_cache:/root/.cache/huggingface). This safely stores the model on the host OS and eliminates 300MB+ downloads on every container restart.
CPU Bottlenecks & FastAPI Freezes: Running a local LLM blocks the main execution thread, which originally caused the asynchronous FastAPI server to freeze. Fix: Changed the inference endpoint from async def to a synchronous def to utilize FastAPI's background thread pooling, and pre-warmed the AI model in memory during the app's startup sequence to cut the "cold start" delay.
Small Model Hallucinations: The 135M parameter model lacks the reasoning capacity of larger models and frequently hallucinated facts outside the PDF context. Fix: Bound the model to strict ChatML formatting (<|im_start|>) in the prompt template, creating an explicit "escape hatch" that forces the model to surrender with "I do not have info on that" instead of guessing.
Blocking PDF Ingestion: Embedding a new PDF is CPU-intensive and would have stalled all chat requests. Fix: Ingestion runs in a threading.Thread with a lock on the vector store writer, keeping the /chat endpoint available throughout.
In-Memory Vector Store Lost on Restart: The default LangChain DuckDB integration creates an in-memory database that is destroyed on every container restart, forcing a full re-embedding of all documents. Fix: Replaced the in-memory connection with a persistent duckdb.connect("data/vectorstore.duckdb") handle backed by a Docker volume mount. A companion manifest.json tracks ingestion state, allowing the engine to skip re-embedding on normal restarts and gracefully handle crash recovery.

Setup & Installation

Prerequisites

Docker Desktop (running)
Python 3.10+ (for the Gradio UI, runs on the host)
A Hugging Face account and access token

1. Clone & Configure

git clone https://github.com/shahzaibali-edu/Containerized-Local-RAG-API.git
cd Containerized-Local-RAG-API

Copy the environment template and fill in your token:

cp .env.example .env
# Open .env and set:  HF_TOKEN=your_token_here

2. (Optional) Pre-load PDFs

Drop any PDF files into the data/ folder. They will be automatically ingested when the container starts and will persist across future restarts.

3. Start the Backend

docker-compose up --build

First boot: Docker downloads model weights to hf_cache/ and embeds any PDFs in data/. Subsequent boots restore from the persistent vector store instantly — no re-embedding.

Watch for these messages in the logs:

✓ 'YourFile.pdf' ingested — XX chunks     ← first boot
✓ Restored vectorstore: N document(s) loaded from disk  ← subsequent boots

4. Launch the Frontend

In a separate terminal:

pip install gradio requests
python ui.py

Open http://127.0.0.1:7860 to access the chat interface.

API Endpoints

Method	Endpoint	Description
`POST`	`/api/v1/chat`	Ask a question. Body: `{"query": "your question"}`
`POST`	`/api/v1/upload`	Upload a PDF for ingestion. Multipart form: `file=<pdf>`
`GET`	`/api/v1/documents`	List all ingested documents and their current status
`GET`	`/docs`	Auto-generated Swagger UI for API testing

Performance Notes

This pipeline is configured for local CPU-only inference. Response times range from 20–50 seconds depending on hardware. To upgrade, swap the model_id in app/engine.py for a larger model (e.g., Qwen/Qwen2.5-1.5B-Instruct).

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
data		data
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
test_api.py		test_api.py
ui.py		ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Containerized Local RAG API

Architecture & Tech Stack

Key Features

Challenges & Engineering Workarounds

Setup & Installation

Prerequisites

1. Clone & Configure

2. (Optional) Pre-load PDFs

3. Start the Backend

4. Launch the Frontend

API Endpoints

Performance Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Containerized Local RAG API

Architecture & Tech Stack

Key Features

Challenges & Engineering Workarounds

Setup & Installation

Prerequisites

1. Clone & Configure

2. (Optional) Pre-load PDFs

3. Start the Backend

4. Launch the Frontend

API Endpoints

Performance Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages