Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework by sponge225 · Pull Request #825 · volcengine/OpenViking

sponge225 · 2026-03-20T12:35:45Z

Description

RAG benchmark

RAG benchmark 是一个用于评测 Openviking 的 RAG (检索增强生成) 系统性能的框架，支持多个数据集和多种评测指标。

RAG benchmark is a framework for evaluating Openviking‘s RAG (Retrieval-Augmented Generation) system performance, supporting multiple datasets and metrics.

Features

支持 Locomo、FinanceBench、Qasper、SyllabusQA 数据集 / Supports Locomo, FinanceBench, Qasper, SyllabusQA datasets
完整的评测流程：数据准备 → 向量检索 → LLM 生成 → 自动评分 / Complete evaluation pipeline: data preparation → vector retrieval → LLM generation → auto-grading
评测指标：Recall、F1 Score、Accuracy / Recall, F1 Score, Accuracy
灵活的 YAML 配置 / Flexible YAML configuration
可扩展设计 / Extensible design

详细文档请查看 benchmark/RAG/README.md。
See benchmark/RAG/README.md for detailed documentation.

Related Issue

Summary

This PR adds the RAG benchmark framework.

Closes #885

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

CLAassistant · 2026-03-20T12:36:20Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

github-actions · 2026-03-20T12:37:49Z

Failed to generate code suggestions for PR

…pdates - Add complete dataset sampling scripts with document-level sampling - Implement filtering logic consistent with adapters (exclude category 5 for Locomo, no answer for SyllabusQA, unanswerable for Qasper) - Update configuration from raw_data/dataset_dir to dataset_path for clarity - Enhance adapters with improved path handling and data loading - Add gitignore for data and output directories - Add dependencies (datasets, pandas, tavily-python) - Add test files and documentation

- Implement stratified sampling for Locomo (by category 1-4) - Implement stratified sampling for SyllabusQA (by question_type) - Implement stratified sampling for Qasper (by answer type: extractive/free_form/yes_no) - Implement stratified sampling for FinanceBench (by question_type) - Add proper handling when sample size cannot be evenly split: - Display warning message - Distribute remaining QAs to first N categories - Fall back to random sampling if sample size too small - Update prepare_dataset.py to support both 'random' and 'stratified' modes - Set default sampling mode to 'random'

…ents

# Conflicts: # uv.lock

qin-ctx

中文 review 摘要：这个 PR 引入了一个独立的 RAG benchmark 框架，模块拆分整体合理，README 和数据集支持范围也比较完整。但当前有 3 个阻断性的运行问题：1) 默认配置文件 config/config.yaml 的结构和运行时代码不一致，默认入口会直接因为缺少 execution 节点而失败；2) sample_locomo() 在 num_docs + sample_size + random 路径里把 dict 放进 set，会直接抛出 TypeError: unhashable type: 'dict'；3) sample_locomo() 在纯 stratified 路径里把 int 和 list 相加，也会直接报错。因为这些问题都落在 README 主流程覆盖到的入口上，我这边先请求修改。

benchmark/RAG/config/config.yaml

benchmark/RAG/scripts/sample_dataset.py

- Fix two bugs: 1. num_docs + sample_size + random path: use int indices instead of dict tuples 2. pure stratified path: use len() for list length calculation - Extract common sampling utilities: - calculate_category_targets() - stratified_sample_with_reallocation() - random_sample_qas() - sample_docs_stratified() - sample_docs_random() - Reduce code duplication by ~60-70% - Improve maintainability and readability - Keep full backward compatibility

- Add FinanceBench to supported datasets list - Change to template configuration format - Add execution: section for better organization

- Remove duplicate monitor.worker_end(success=False) call in run_generation() - The _process_generation_task() already calls worker_end() in its exception handler - This prevents double-counting of failed tasks and distorted statistics

- Add JSON file support to _get_required_syllabi() - Extract syllabus names from JSON keys (same format as _load_from_json()) - This ensures data_prepare() processes correct docx files when using JSON input

- Replace 'raise e' with bare 'raise' to preserve original traceback - Also remove unused 'e' variable since we don't need it - This makes debugging easier by showing where the exception actually occurred

- In Locomo prompt, use gold_answer_str instead of gold_answer - This ensures consistent formatting when gold_answer is a list - Both Locomo and Generic prompts now use the same ' | ' separated format

- Replace manual common ancestor calculation with os.path.commonpath() - os.path.commonpath() handles all OS path separators correctly - Add try-except to handle ValueError when no common path exists - More robust than manual split(os.sep) approach

github-actions · 2026-03-30T06:54:24Z

Failed to generate code suggestions for PR

qin-ctx

This PR moves the RAG benchmark effort in the right direction: the adapter/pipeline split is sensible, and the documentation is much more complete than a typical first feature drop. I also verified that the three blocking issues from the previous review have been addressed in the current head.

That said, there are still three blocking correctness issues on the main benchmark path: skip_ingestion does not actually skip the preprocessing dependency on raw source documents, Qasper mixed-annotator questions can still be graded as fully correct for a refusal answer, and LLM invocation failures are converted into normal answer strings instead of failing the run. Those issues affect either promised behavior or benchmark result integrity, so I am requesting changes.

Non-blocking note: CI is still red, including ruff format --check, and there is no benchmark-specific test coverage yet for these newly added paths.

benchmark/RAG/src/pipeline.py

qin-ctx · 2026-03-30T07:08:49Z

benchmark/RAG/src/adapters/qasper_adapter.py

+                    answer_type = self._get_answer_type(answer_obj)
+
+                    # Process different answer types
+                    if answer_obj.get("unanswerable", False):


[Bug] (blocking) This still keeps "Not mentioned" as a valid gold answer for mixed-annotator questions. The question is skipped only when all annotators mark it unanswerable, but inside the per-annotator loop any single unanswerable=true answer is still appended to gold_answers. That means a question with one real answer and one unanswerable annotation can become ['Not mentioned', '42'], and the later refusal heuristic in pipeline.py will score a model refusal as fully correct. In other words, benchmark metrics can be inflated by accepting a refusal on answerable questions. For non-fully-unanswerable questions, unanswerable annotations should be ignored rather than preserved as a gold answer.

benchmark/RAG/src/core/llm_client.py

sponge225 added 9 commits March 19, 2026 16:37

Add RAGbenchmark: RAG system evaluation framework

77169b0

Update README.md

024e0a2

Update README.md

fa9dfc9

Merge branch 'volcengine:main' into feat/rag

6ed11f5

Merge branch 'volcengine:main' into feat/rag

e7a7130

Merge branch 'volcengine:main' into feat/rag

eb01792

Update README.md

d1b6858

Code structure refactoring

ec5bae3

Merge branch 'volcengine:main' into feat/rag

a45c19b

github-project-automation bot added this to OpenViking project Mar 20, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 20, 2026

sponge225 changed the title ~~Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework~~ (WIP)Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework Mar 20, 2026

sponge225 marked this pull request as draft March 20, 2026 12:55

sponge225 force-pushed the feat/rag branch from c6540c9 to f56854f Compare March 24, 2026 03:04

sponge225 added 4 commits March 24, 2026 11:16

Update locomo adapter to support image attachments and other improvem…

e3280d3

…ents

Update dataset documentation with actual document counts

8473e67

Merge remote-tracking branch 'upstream/main' into merge-upstream-main

5e179c1

# Conflicts: # uv.lock

qin-ctx requested changes Mar 25, 2026

View reviewed changes

benchmark/RAG/config/config.yaml Show resolved Hide resolved

benchmark/RAG/scripts/sample_dataset.py Outdated Show resolved Hide resolved

benchmark/RAG/scripts/sample_dataset.py Outdated Show resolved Hide resolved

Add benchmark results reference and reproduction steps

01e6804

qin-ctx self-assigned this Mar 25, 2026

sponge225 added 6 commits March 25, 2026 16:42

Improve sampling scripts for benchmark reproducibility

47e41b1

Update config.yaml: improve configuration structure

42c468e

- Add FinanceBench to supported datasets list - Change to template configuration format - Add execution: section for better organization

Fix bug: _get_required_syllabi() doesn't support JSON input

7e85ea6

- Add JSON file support to _get_required_syllabi() - Extract syllabus names from JSON keys (same format as _load_from_json()) - This ensures data_prepare() processes correct docx files when using JSON input

Improve exception re-raising: use bare raise to preserve traceback

625f840

- Replace 'raise e' with bare 'raise' to preserve original traceback - Also remove unused 'e' variable since we don't need it - This makes debugging easier by showing where the exception actually occurred

sponge225 added 2 commits March 25, 2026 18:04

Fix bug: Locomo prompt uses raw gold_answer instead of gold_answer_str

800857f

- In Locomo prompt, use gold_answer_str instead of gold_answer - This ensures consistent formatting when gold_answer is a list - Both Locomo and Generic prompts now use the same ' | ' separated format

qin-ctx marked this pull request as ready for review March 30, 2026 06:53

qin-ctx requested changes Mar 30, 2026

View reviewed changes

benchmark: honor skip_ingestion and fail on LLM retry exhaustion

19db010

sponge225 changed the title ~~(WIP)Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework~~ Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework#825

Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework#825
sponge225 wants to merge 24 commits intovolcengine:mainfrom
sponge225:feat/rag

sponge225 commented Mar 20, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

qin-ctx left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

qin-ctx left a comment

Uh oh!

Uh oh!

qin-ctx Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sponge225 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

RAG benchmark

Features

Related Issue

Summary

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

CLAassistant commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qin-ctx Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sponge225 commented Mar 20, 2026 •

edited

Loading