Skip to content

[Feature]: Add RAG (Retrieval-Augmented Generation) Benchmark System for OpenViking #885

@sponge225

Description

@sponge225

Problem Statement

There is currently no standardized benchmarking framework to evaluate the effectiveness and performance of OpenViking's RAG (Retrieval-Augmented Generation) capabilities. Users and developers lack a way to:

  • Measure retrieval quality across different datasets and domains
  • Evaluate end-to-end RAG pipeline performance (ingestion → retrieval → generation)
  • Compare OpenViking's RAG performance with other systems in a reproducible manner
  • Track improvements to RAG functionality over time

Proposed Solution

Add a comprehensive RAG benchmark framework under benchmark/RAG/ that provides:

1. Diverse and Extensible Dataset Support

  • Diverse Scenarios: Support for various use cases including multi-turn conversations, educational syllabi, academic papers, financial reports, etc."
  • Easy Expansion: Adapter-based architecture allows seamless addition of new datasets
  • Flexible QA Selection: Users can choose between using full dataset QA or sampled subsets

Currently supported datasets

  • Locomo : Real user chat conversations with 4 question categories: factual, temporal, reasoning, and understanding.
  • SyllabusQA : University course syllabi with 6 question types ranging from factual extraction to summarization.
  • Qasper : NLP research papers with questions having extractive, free-form, or yes/no answer types.
  • FinanceBench : SEC financial reports (10-K, 8-K, earnings calls) with domain-specific financial questions.

2. Modular Architecture

  • Adapter Pattern: Dataset-specific adapters for easy extension
  • Configuration-Driven: YAML-based configuration per dataset
  • Pipeline Design: Clear separation of ingestion, generation, and evaluation stages

3. Comprehensive Evaluation

  • Retrieval Metrics: Recall@k
  • Answer Quality: F1 score
  • LLM-as-Judge: 0-4 scale accuracy rating
  • Performance Metrics: Latency, token usage, ingestion time

4. Data Pipeline Improvements

  • Download Script: Automated dataset download from public sources
  • Sampling Script: Seed-based random sampling for reproducible subsets
  • Preprocessing Pipeline: Standardized data preparation workflow

Alternatives Considered

No response

Feature Area

Other

Use Case

When users and developers want to:

  • Evaluate OpenViking's RAG performance on their specific use cases
  • Compare different configurations (embedding models, retrieval strategies, etc.)
  • Verify that RAG improvements don't regress performance
  • Contribute new datasets or evaluation methods to the OpenViking ecosystem
  • Reproduce benchmark results from the community

Implementation Plan

Files to create/modify

  • New directory: benchmark/RAG/
  • Dataset preparation scripts: benchmark/RAG/scripts/*.py
  • Core pipeline: benchmark/RAG/src/pipeline.py
  • Adapter base: benchmark/RAG/src/adapters/base.py
  • Dataset adapters: benchmark/RAG/src/adapters/*_adapter.py
  • Core utilities: benchmark/RAG/src/core/*.py
  • Configuration files: benchmark/RAG/config/*.yaml
  • Main script: benchmark/RAG/run.py
  • Documentation: benchmark/RAG/README.md

Phases

  • Phase 1: Core framework with extensible dataset architecture (completed)
  • Phase 2: Add scripts/ directory with download, sampling, and dataset preparation scripts (pending)
    • Download script: Automated dataset download from public sources
    • Sampling script: Seed-based random sampling with full/sampled QA options
    • Unified entry script: Orchestrate end-to-end data preparation workflow
  • Phase 3: Feature Expansion and Architecture Optimization (future)
    • Add more datasets and evaluation metrics
    • Reserve architecture design for agentic RAG integration
    • Enhance evaluation capabilities and flexibility
  • Phase 4: Long-term Planning (future)
    • Explore agentic RAG integration with VikingBot

Example API (Optional)

Additional Context

No response

Contribution

  • I am willing to contribute to implementing this feature

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions