Commit 1b3a8f2
authored
Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework (#825)
* Add RAGbenchmark: RAG system evaluation framework
* Update README.md
* Update README.md
* Update README.md
* Code structure refactoring
* feat: improve RAG benchmark with dataset sampling and configuration updates
- Add complete dataset sampling scripts with document-level sampling
- Implement filtering logic consistent with adapters (exclude category 5 for Locomo, no answer for SyllabusQA, unanswerable for Qasper)
- Update configuration from raw_data/dataset_dir to dataset_path for clarity
- Enhance adapters with improved path handling and data loading
- Add gitignore for data and output directories
- Add dependencies (datasets, pandas, tavily-python)
- Add test files and documentation
* feat: add stratified sampling support to all datasets
- Implement stratified sampling for Locomo (by category 1-4)
- Implement stratified sampling for SyllabusQA (by question_type)
- Implement stratified sampling for Qasper (by answer type: extractive/free_form/yes_no)
- Implement stratified sampling for FinanceBench (by question_type)
- Add proper handling when sample size cannot be evenly split:
- Display warning message
- Distribute remaining QAs to first N categories
- Fall back to random sampling if sample size too small
- Update prepare_dataset.py to support both 'random' and 'stratified' modes
- Set default sampling mode to 'random'
* Update locomo adapter to support image attachments and other improvements
* Update dataset documentation with actual document counts
* Add benchmark results reference and reproduction steps
* Improve sampling scripts for benchmark reproducibility
* Refactor sample_dataset.py: extract common sampling logic
- Fix two bugs:
1. num_docs + sample_size + random path: use int indices instead of dict tuples
2. pure stratified path: use len() for list length calculation
- Extract common sampling utilities:
- calculate_category_targets()
- stratified_sample_with_reallocation()
- random_sample_qas()
- sample_docs_stratified()
- sample_docs_random()
- Reduce code duplication by ~60-70%
- Improve maintainability and readability
- Keep full backward compatibility
* Update config.yaml: improve configuration structure
- Add FinanceBench to supported datasets list
- Change to template configuration format
- Add execution: section for better organization
* Fix bug: duplicate worker_end() call in generation failure path
- Remove duplicate monitor.worker_end(success=False) call in run_generation()
- The _process_generation_task() already calls worker_end() in its exception handler
- This prevents double-counting of failed tasks and distorted statistics
* Fix bug: _get_required_syllabi() doesn't support JSON input
- Add JSON file support to _get_required_syllabi()
- Extract syllabus names from JSON keys (same format as _load_from_json())
- This ensures data_prepare() processes correct docx files when using JSON input
* Improve exception re-raising: use bare raise to preserve traceback
- Replace 'raise e' with bare 'raise' to preserve original traceback
- Also remove unused 'e' variable since we don't need it
- This makes debugging easier by showing where the exception actually occurred
* Fix bug: Locomo prompt uses raw gold_answer instead of gold_answer_str
- In Locomo prompt, use gold_answer_str instead of gold_answer
- This ensures consistent formatting when gold_answer is a list
- Both Locomo and Generic prompts now use the same ' | ' separated format
* Improve directory ingest: use os.path.commonpath() for robustness
- Replace manual common ancestor calculation with os.path.commonpath()
- os.path.commonpath() handles all OS path separators correctly
- Add try-except to handle ValueError when no common path exists
- More robust than manual split(os.sep) approach
* benchmark: honor skip_ingestion and fail on LLM retry exhaustion1 parent 27f6188 commit 1b3a8f2
File tree
32 files changed
+5971
-0
lines changed- benchmark/RAG
- config
- scripts
- src
- adapters
- core
32 files changed
+5971
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
124 | 124 | | |
125 | 125 | | |
126 | 126 | | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
127 | 136 | | |
128 | 137 | | |
129 | 138 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
0 commit comments