Skip to content

arjun7579/mindoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

49 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  Mindoc โ€” Private Offline RAG Assistant

Mindoc is a privacy-focused, fully offline AI assistant that allows you to search and chat with your PDFs and PPTs locally. No cloud, no API keys โ€” everything runs 100% on your device using efficient Small Language Models (SLMs).


๐Ÿš€ Key Features

๐Ÿ”’ 100% Offline

  • No OpenAI
  • No cloud dependencies
  • No data leaves your device
  • All models stored locally

๐Ÿ“„ Multi-Document Ingestion

  • Upload multiple PDFs and PPTs
  • Fully private, local processing
  • Fast and accurate extraction

๐Ÿง  Local LLM

Powered by LaMini-Flan-T5 (248M) optimized for CPU inference.

๐Ÿ”Ž Hybrid Search Engine

Semantic Vector Search + Cross-Encoder Reranking

  • Vector Model: all-MiniLM-L6-v2
  • Reranker: ms-marco-MiniLM-L12-v2

โšก Dual Search Modes

  • Quick Mode: Fast, short answers (Top-2 docs)
  • Deep Research: Multi-doc reasoning using Map-Reduce

๐Ÿ”— Smart Citations

  • Evidence-based answers
  • Click on a citation โ†’ open PDF โ†’ auto-scroll to exact page

๐Ÿ› ๏ธ Technical Architecture

1. Ingestion Pipeline

  • Loader: PyMuPDFLoader
  • Chunking: RecursiveCharacterTextSplitter
    • Chunk size: 1000 chars
    • Overlap: 200 chars
  • Embeddings: SentenceTransformer (384-dim)
  • Storage: ChromaDB (Local persistence)

2. Retrieval & Generation

  1. Retrieve top-10 chunks with vector search
  2. Re-rank with cross-encoder, keep best 3
  3. Feed context โ†’ LaMini LLM โ†’ generate answer

๐Ÿ—๏ธ System Architecture

       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚   Files    โ”‚
       โ”‚ PDF / PPTX โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚
              โ–ผ
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚ Document Loaderโ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚Chunks
              โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚Embeddings (Local Model)โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚Vectors
              โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ Vector Store     โ”‚
    โ”‚ FAISS / Chroma   โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚
              โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ RAG Pipeline      โ”‚
    โ”‚ (Retrieve + LLM)  โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚
              โ–ผ
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚  FastAPI    โ”‚
       โ”‚  /query     โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ฆ Installation Guide

Prerequisites

  • Python 3.10+ (3.12 recommended)
  • Node.js & npm

๐Ÿ”ง Backend Setup (FastAPI)

cd backend

# Virtual environment
python -m venv venv
source venv/bin/activate    # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download models (run once)
python download_model.py
python download_reranker.py

# Start backend server
uvicorn app.main:app --reload

๐ŸŽจ Frontend Setup (React + Vite)

cd frontend

npm install
npm run dev

๐Ÿ–ฅ๏ธ Usage

Upload Documents

  • Drag & drop PDFs
  • Supports batch uploads
  • Wait for โ€œโœ… Indexedโ€ confirmation

Chat with Your Documents

โšก Quick Mode

  • Fast
  • Lightweight
  • Best for direct questions

๐Ÿง  Deep Research

  • Reads many chunks
  • Map-Reduce summarization
  • Great for reports & summaries

Verify Sources

  • Each answer includes clickable citations
  • Opens full PDF and auto-scrolls to correct page

๐Ÿ“‚ Project Structure

mindoc/
โ”œโ”€โ”€ backend/
โ”‚ โ”œโ”€โ”€ app/
โ”‚ โ”‚ โ”œโ”€โ”€ api/
โ”‚ โ”‚ โ”œโ”€โ”€ rag/
โ”‚ โ”‚ โ”œโ”€โ”€ services/
โ”‚ โ”‚ โ””โ”€โ”€ main.py
โ”‚ โ”œโ”€โ”€ data/
โ”‚ โ”‚ โ”œโ”€โ”€ chroma/
โ”‚ โ”‚ โ”œโ”€โ”€ models/
โ”‚ โ”‚ โ””โ”€โ”€ uploads/
โ”‚ โ”œโ”€โ”€ download_model.py
โ”‚ โ”œโ”€โ”€ download_reranker.py
โ”‚ โ””โ”€โ”€ requirements.txt
โ”‚
โ””โ”€โ”€ frontend/
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ App.jsx
โ”‚ โ”œโ”€โ”€ App.css
โ”‚ โ””โ”€โ”€ main.jsx
โ””โ”€โ”€ package.json

โ— Troubleshooting

sqlite3 errors (Python 3.12)

Install SQLite shim:

pip install pysqlite3-binary

Context Window Errors

Long chunks โ†’ crash. Fixed by enabling:

truncation=True

500 Search Errors

Usually caused by a missing reranker model. Run again:

python download_reranker.py

๐Ÿ”ฎ Future Roadmap

  • OCR for scanned documents
  • Model switching (LaMini โ†” Phi-2 โ†” Qwen 0.5B)
  • Persistent conversation history
  • Voice mode (offline ASR)

About

Mindoc is a fully offline, privacy-first document assistant that lets you chat with your PDFs and presentations using a local RAG pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors