[Feature]: Add RAG (Retrieval-Augmented Generation) Benchmark System for OpenViking

### Problem Statement

There is currently no standardized benchmarking framework to evaluate the effectiveness and performance of OpenViking's RAG (Retrieval-Augmented Generation) capabilities. Users and developers lack a way to:
- Measure retrieval quality across different datasets and domains
- Evaluate end-to-end RAG pipeline performance (ingestion → retrieval → generation)
- Compare OpenViking's RAG performance with other systems in a reproducible manner
- Track improvements to RAG functionality over time

### Proposed Solution

Add a comprehensive RAG benchmark framework under `benchmark/RAG/` that provides:
### 1. Diverse and Extensible Dataset Support

- **Diverse Scenarios**: Support for various use cases including multi-turn conversations, educational syllabi, academic papers, financial reports, etc."
- **Easy Expansion**: Adapter-based architecture allows seamless addition of new datasets
- **Flexible QA Selection**: Users can choose between using full dataset QA or sampled subsets

Currently supported datasets
  - Locomo : Real user chat conversations with 4 question categories: factual, temporal, reasoning, and understanding.
  - SyllabusQA : University course syllabi with 6 question types ranging from factual extraction to summarization.
  - Qasper : NLP research papers with questions having extractive, free-form, or yes/no answer types.
  - FinanceBench : SEC financial reports (10-K, 8-K, earnings calls) with domain-specific financial questions.


### 2. Modular Architecture
- **Adapter Pattern**: Dataset-specific adapters for easy extension
- **Configuration-Driven**: YAML-based configuration per dataset
- **Pipeline Design**: Clear separation of ingestion, generation, and evaluation stages

### 3. Comprehensive Evaluation
- **Retrieval Metrics**: Recall@k
- **Answer Quality**: F1 score
- **LLM-as-Judge**: 0-4 scale accuracy rating
- **Performance Metrics**: Latency, token usage, ingestion time

### 4. Data Pipeline Improvements 
- **Download Script**: Automated dataset download from public sources
- **Sampling Script**: Seed-based random sampling for reproducible subsets
- **Preprocessing Pipeline**: Standardized data preparation workflow


### Alternatives Considered

_No response_

### Feature Area

Other

### Use Case

When users and developers want to:
- Evaluate OpenViking's RAG performance on their specific use cases
- Compare different configurations (embedding models, retrieval strategies, etc.)
- Verify that RAG improvements don't regress performance
- Contribute new datasets or evaluation methods to the OpenViking ecosystem
- Reproduce benchmark results from the community

## Implementation Plan

### Files to create/modify
- **New directory**: `benchmark/RAG/`
- **Dataset preparation scripts**: `benchmark/RAG/scripts/*.py`
- **Core pipeline**: `benchmark/RAG/src/pipeline.py`
- **Adapter base**: `benchmark/RAG/src/adapters/base.py`
- **Dataset adapters**: `benchmark/RAG/src/adapters/*_adapter.py`
- **Core utilities**: `benchmark/RAG/src/core/*.py`
- **Configuration files**: `benchmark/RAG/config/*.yaml`
- **Main script**: `benchmark/RAG/run.py`
- **Documentation**: `benchmark/RAG/README.md`

### Phases
- **Phase 1**: Core framework with extensible dataset architecture (completed)
- **Phase 2**: Add `scripts/` directory with download, sampling, and dataset preparation scripts (pending)
  - Download script: Automated dataset download from public sources
  - Sampling script: Seed-based random sampling with full/sampled QA options
  - Unified entry script: Orchestrate end-to-end data preparation workflow
- **Phase 3**: Feature Expansion and Architecture Optimization (future)
  - Add more datasets and evaluation metrics
  - Reserve architecture design for agentic RAG integration
  - Enhance evaluation capabilities and flexibility
- **Phase 4**: Long-term Planning (future)
  - Explore agentic RAG integration with VikingBot


### Example API (Optional)

```python

```

### Additional Context

_No response_

### Contribution

- [ ] I am willing to contribute to implementing this feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add RAG (Retrieval-Augmented Generation) Benchmark System for OpenViking #885

Problem Statement

Proposed Solution

1. Diverse and Extensible Dataset Support

2. Modular Architecture

3. Comprehensive Evaluation

4. Data Pipeline Improvements

Alternatives Considered

Feature Area

Use Case

Implementation Plan

Files to create/modify

Phases

Example API (Optional)

Additional Context

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Add RAG (Retrieval-Augmented Generation) Benchmark System for OpenViking #885

Description

Problem Statement

Proposed Solution

1. Diverse and Extensible Dataset Support

2. Modular Architecture

3. Comprehensive Evaluation

4. Data Pipeline Improvements

Alternatives Considered

Feature Area

Use Case

Implementation Plan

Files to create/modify

Phases

Example API (Optional)

Additional Context

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions