Skip to content

Commit 1b3a8f2

Browse files
authored
Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework (#825)
* Add RAGbenchmark: RAG system evaluation framework * Update README.md * Update README.md * Update README.md * Code structure refactoring * feat: improve RAG benchmark with dataset sampling and configuration updates - Add complete dataset sampling scripts with document-level sampling - Implement filtering logic consistent with adapters (exclude category 5 for Locomo, no answer for SyllabusQA, unanswerable for Qasper) - Update configuration from raw_data/dataset_dir to dataset_path for clarity - Enhance adapters with improved path handling and data loading - Add gitignore for data and output directories - Add dependencies (datasets, pandas, tavily-python) - Add test files and documentation * feat: add stratified sampling support to all datasets - Implement stratified sampling for Locomo (by category 1-4) - Implement stratified sampling for SyllabusQA (by question_type) - Implement stratified sampling for Qasper (by answer type: extractive/free_form/yes_no) - Implement stratified sampling for FinanceBench (by question_type) - Add proper handling when sample size cannot be evenly split: - Display warning message - Distribute remaining QAs to first N categories - Fall back to random sampling if sample size too small - Update prepare_dataset.py to support both 'random' and 'stratified' modes - Set default sampling mode to 'random' * Update locomo adapter to support image attachments and other improvements * Update dataset documentation with actual document counts * Add benchmark results reference and reproduction steps * Improve sampling scripts for benchmark reproducibility * Refactor sample_dataset.py: extract common sampling logic - Fix two bugs: 1. num_docs + sample_size + random path: use int indices instead of dict tuples 2. pure stratified path: use len() for list length calculation - Extract common sampling utilities: - calculate_category_targets() - stratified_sample_with_reallocation() - random_sample_qas() - sample_docs_stratified() - sample_docs_random() - Reduce code duplication by ~60-70% - Improve maintainability and readability - Keep full backward compatibility * Update config.yaml: improve configuration structure - Add FinanceBench to supported datasets list - Change to template configuration format - Add execution: section for better organization * Fix bug: duplicate worker_end() call in generation failure path - Remove duplicate monitor.worker_end(success=False) call in run_generation() - The _process_generation_task() already calls worker_end() in its exception handler - This prevents double-counting of failed tasks and distorted statistics * Fix bug: _get_required_syllabi() doesn't support JSON input - Add JSON file support to _get_required_syllabi() - Extract syllabus names from JSON keys (same format as _load_from_json()) - This ensures data_prepare() processes correct docx files when using JSON input * Improve exception re-raising: use bare raise to preserve traceback - Replace 'raise e' with bare 'raise' to preserve original traceback - Also remove unused 'e' variable since we don't need it - This makes debugging easier by showing where the exception actually occurred * Fix bug: Locomo prompt uses raw gold_answer instead of gold_answer_str - In Locomo prompt, use gold_answer_str instead of gold_answer - This ensures consistent formatting when gold_answer is a list - Both Locomo and Generic prompts now use the same ' | ' separated format * Improve directory ingest: use os.path.commonpath() for robustness - Replace manual common ancestor calculation with os.path.commonpath() - os.path.commonpath() handles all OS path separators correctly - Add try-except to handle ValueError when no common path exists - More robust than manual split(os.sep) approach * benchmark: honor skip_ingestion and fail on LLM retry exhaustion
1 parent 27f6188 commit 1b3a8f2

32 files changed

+5971
-0
lines changed

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,15 @@ tests/api_test/api-test-report.html
124124
tests/api_test/openviking-server.log
125125
tests/api_test/openviking-server.pid
126126

127+
# Benchmark outputs
128+
examples/benchmark/outputs/
129+
examples/benchmark/datasets/full/
130+
examples/benchmark/*.log
131+
RAGbenchmark/datasets/*/
132+
!RAGbenchmark/datasets/Benchmark_Lite/
133+
RAGbenchmark/Output/
134+
RAGbenchmark/*.log
135+
127136
# AI Coding
128137
CLAUDE.md
129138
*.so

benchmark/RAG/.gitignore

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Raw datasets (downloaded from external sources)
2+
raw_data/
3+
4+
# Processed datasets (sampled subsets)
5+
datasets/
6+
data/
7+
8+
# Processed documents and vector storage
9+
ov_storage/
10+
11+
# Evaluation output results
12+
Output/
13+
14+
# Python
15+
__pycache__/
16+
*.py[cod]
17+
*$py.class
18+
*.so
19+
.Python
20+
build/
21+
develop-eggs/
22+
dist/
23+
downloads/
24+
eggs/
25+
.eggs/
26+
lib/
27+
lib64/
28+
parts/
29+
sdist/
30+
var/
31+
wheels/
32+
*.egg-info/
33+
.installed.cfg
34+
*.egg
35+
36+
# Virtual Environment
37+
.venv/
38+
env/
39+
ENV/
40+
41+
# IDE
42+
.vscode/
43+
.idea/
44+
*.swp
45+
*.swo
46+
*~
47+
48+
# Logs
49+
*.log
50+
51+
# Temporary files
52+
*.tmp
53+
*.temp

0 commit comments

Comments
 (0)