A comprehensive AI-powered tool for collecting research papers from arXiv, processing their full text, and enabling intelligent question-answering through semantic search and large language models.
- Features
- Architecture
- Installation
- Usage
- How It Works
- Dependencies
- License
- Contributing
- Acknowledgments
- Automated Paper Collection: Fetch research papers from arXiv based on topics and quantity limits
- Full-Text Extraction: Extract complete text content from PDF papers using PyMuPDF
- Intelligent Text Chunking: Use semantic chunkers to split text into meaningful segments
- Vector Database Storage: Store text chunks with embeddings in Qdrant for efficient retrieval
- Semantic Search: Perform similarity-based search on stored chunks
- AI-Powered QA: Answer questions using Google's Gemini LLM with retrieved context
- Web Interface: User-friendly Streamlit application with separate pages for paper collection and chat
- Persistent Storage: Local Qdrant database for data persistence across sessions
The system follows a Retrieval-Augmented Generation (RAG) pattern:
-
Data Ingestion Pipeline:
- ArXiv API integration for paper discovery
- PDF download and text extraction
- Semantic text chunking
-
Vector Storage Layer:
- Sentence Transformers for embedding generation
- Qdrant vector database for storage and search
-
QA Engine:
- Semantic search for relevant context retrieval
- Gemini LLM for answer generation
-
User Interface:
- Streamlit web application
- Multi-page interface (Collection + Chat)
- Python 3.8 or higher
- A Google Gemini API key (free tier available)
-
Clone the repository:
git clone https://github.com/suriyasureshok/Research_Paper_QA_Copilot.git cd Research_Paper_QA_Copilot -
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a .env file in the root directory:
GEMINI_API_KEY=your_gemini_api_key_hereGet your API key from Google AI Studio.
-
Start the Streamlit app:
bash streamlit run src/app.py -
Access the web interface: Open your browser and navigate to http://localhost:8501
- Enter a research topic (e.g., "machine learning", "quantum computing")
- Specify the maximum number of papers to collect
- Click "Collect Papers" to start the ingestion process
- The system will:
- Search arXiv for relevant papers
- Download PDF files
- Extract full text content
- Chunk the text semantically
- Generate embeddings and store in Qdrant
- Enter your question about the collected research papers
- The system will:
- Search for relevant text chunks in the vector database
- Provide context to the Gemini LLM
- Generate an answer based on the research content
You can also test individual components:
# Test paper collection
from src.data_ingestion.paper_collector import ArxivPaperCollection
collector = ArxivPaperCollection()
papers = collector.fetch_papers("machine learning", max_results=5)
# Test text chunking
from src.data_ingestion.text_chunker import TextChunker
chunker = TextChunker()
chunks = chunker.chunk_text("Your text here...")
# Test vector storage
from src.data_ingestion.store_data import DataStore
store = DataStore()
store.add_chunks(chunks)
# Test QA
from src.chat.gemini_client import GeminiQAHandler
handler = GeminiQAHandler("your_api_key", store)
answer = handler.answer_question("What is machine learning?")The ArxivPaperCollection class handles paper discovery and processing:
- Uses arXiv's Atom feed API to search for papers by topic
- Downloads PDF files from arXiv
- Extracts text using PyMuPDF (fitz)
- Filters out non-text elements and formatting artifacts
The TextChunker class uses semantic chunking:
- Employs semantic_chunkers library with StatisticalChunker
- Splits text based on semantic meaning rather than fixed sizes
- Preserves context within each chunk for better retrieval
The DataStore class manages vector operations:
- Uses Sentence Transformers (ll-MiniLM-L6-v2) for 384D embeddings
- Stores chunks with metadata in Qdrant local database
- Performs cosine similarity search for relevant chunks
- Supports batch operations for efficiency
The GeminiQAHandler class orchestrates QA:
- Searches vector database for relevant context
- Constructs prompts with retrieved chunks
- Uses Gemini 2.0 Flash (free tier) for answer generation
- Provides concise, context-aware responses
The Streamlit app provides two main pages:
- Paper Collection: Form-based interface for data ingestion
- Chat Interface: Conversational QA with the research database
Key dependencies include:
- streamlit: Web application framework
- qdrant-client: Vector database client
- sentence-transformers: Embedding generation
- google-generativeai: Gemini LLM integration
- semantic-chunkers: Intelligent text chunking
- feedparser: arXiv API integration
- PyMuPDF: PDF text extraction
- python-dotenv: Environment variable management
See requirements.txt for complete list.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create a feature branch (git checkout -b feature/AmazingFeature)
- Commit your changes (git commit -m 'Add some AmazingFeature')
- Push to the branch (git push origin feature/AmazingFeature)
- Open a Pull Request
- arXiv for providing open access to research papers
- Google for the Gemini AI models
- Qdrant for the vector database
- All contributors to the open-source libraries used
Built with ❤️ for researchers and AI enthusiasts.