Problem Statement
There is currently no standardized benchmarking framework to evaluate the effectiveness and performance of OpenViking's RAG (Retrieval-Augmented Generation) capabilities. Users and developers lack a way to:
- Measure retrieval quality across different datasets and domains
- Evaluate end-to-end RAG pipeline performance (ingestion → retrieval → generation)
- Compare OpenViking's RAG performance with other systems in a reproducible manner
- Track improvements to RAG functionality over time
Proposed Solution
Add a comprehensive RAG benchmark framework under benchmark/RAG/ that provides:
1. Diverse and Extensible Dataset Support
- Diverse Scenarios: Support for various use cases including multi-turn conversations, educational syllabi, academic papers, financial reports, etc."
- Easy Expansion: Adapter-based architecture allows seamless addition of new datasets
- Flexible QA Selection: Users can choose between using full dataset QA or sampled subsets
Currently supported datasets
- Locomo : Real user chat conversations with 4 question categories: factual, temporal, reasoning, and understanding.
- SyllabusQA : University course syllabi with 6 question types ranging from factual extraction to summarization.
- Qasper : NLP research papers with questions having extractive, free-form, or yes/no answer types.
- FinanceBench : SEC financial reports (10-K, 8-K, earnings calls) with domain-specific financial questions.
2. Modular Architecture
- Adapter Pattern: Dataset-specific adapters for easy extension
- Configuration-Driven: YAML-based configuration per dataset
- Pipeline Design: Clear separation of ingestion, generation, and evaluation stages
3. Comprehensive Evaluation
- Retrieval Metrics: Recall@k
- Answer Quality: F1 score
- LLM-as-Judge: 0-4 scale accuracy rating
- Performance Metrics: Latency, token usage, ingestion time
4. Data Pipeline Improvements
- Download Script: Automated dataset download from public sources
- Sampling Script: Seed-based random sampling for reproducible subsets
- Preprocessing Pipeline: Standardized data preparation workflow
Alternatives Considered
No response
Feature Area
Other
Use Case
When users and developers want to:
- Evaluate OpenViking's RAG performance on their specific use cases
- Compare different configurations (embedding models, retrieval strategies, etc.)
- Verify that RAG improvements don't regress performance
- Contribute new datasets or evaluation methods to the OpenViking ecosystem
- Reproduce benchmark results from the community
Implementation Plan
Files to create/modify
- New directory:
benchmark/RAG/
- Dataset preparation scripts:
benchmark/RAG/scripts/*.py
- Core pipeline:
benchmark/RAG/src/pipeline.py
- Adapter base:
benchmark/RAG/src/adapters/base.py
- Dataset adapters:
benchmark/RAG/src/adapters/*_adapter.py
- Core utilities:
benchmark/RAG/src/core/*.py
- Configuration files:
benchmark/RAG/config/*.yaml
- Main script:
benchmark/RAG/run.py
- Documentation:
benchmark/RAG/README.md
Phases
- Phase 1: Core framework with extensible dataset architecture (completed)
- Phase 2: Add
scripts/ directory with download, sampling, and dataset preparation scripts (pending)
- Download script: Automated dataset download from public sources
- Sampling script: Seed-based random sampling with full/sampled QA options
- Unified entry script: Orchestrate end-to-end data preparation workflow
- Phase 3: Feature Expansion and Architecture Optimization (future)
- Add more datasets and evaluation metrics
- Reserve architecture design for agentic RAG integration
- Enhance evaluation capabilities and flexibility
- Phase 4: Long-term Planning (future)
- Explore agentic RAG integration with VikingBot
Example API (Optional)
Additional Context
No response
Contribution
Problem Statement
There is currently no standardized benchmarking framework to evaluate the effectiveness and performance of OpenViking's RAG (Retrieval-Augmented Generation) capabilities. Users and developers lack a way to:
Proposed Solution
Add a comprehensive RAG benchmark framework under
benchmark/RAG/that provides:1. Diverse and Extensible Dataset Support
Currently supported datasets
2. Modular Architecture
3. Comprehensive Evaluation
4. Data Pipeline Improvements
Alternatives Considered
No response
Feature Area
Other
Use Case
When users and developers want to:
Implementation Plan
Files to create/modify
benchmark/RAG/benchmark/RAG/scripts/*.pybenchmark/RAG/src/pipeline.pybenchmark/RAG/src/adapters/base.pybenchmark/RAG/src/adapters/*_adapter.pybenchmark/RAG/src/core/*.pybenchmark/RAG/config/*.yamlbenchmark/RAG/run.pybenchmark/RAG/README.mdPhases
scripts/directory with download, sampling, and dataset preparation scripts (pending)Example API (Optional)
Additional Context
No response
Contribution