Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework#825
Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework#825sponge225 wants to merge 24 commits intovolcengine:mainfrom
Conversation
|
|
|
Failed to generate code suggestions for PR |
…pdates - Add complete dataset sampling scripts with document-level sampling - Implement filtering logic consistent with adapters (exclude category 5 for Locomo, no answer for SyllabusQA, unanswerable for Qasper) - Update configuration from raw_data/dataset_dir to dataset_path for clarity - Enhance adapters with improved path handling and data loading - Add gitignore for data and output directories - Add dependencies (datasets, pandas, tavily-python) - Add test files and documentation
- Implement stratified sampling for Locomo (by category 1-4) - Implement stratified sampling for SyllabusQA (by question_type) - Implement stratified sampling for Qasper (by answer type: extractive/free_form/yes_no) - Implement stratified sampling for FinanceBench (by question_type) - Add proper handling when sample size cannot be evenly split: - Display warning message - Distribute remaining QAs to first N categories - Fall back to random sampling if sample size too small - Update prepare_dataset.py to support both 'random' and 'stratified' modes - Set default sampling mode to 'random'
# Conflicts: # uv.lock
qin-ctx
left a comment
There was a problem hiding this comment.
中文 review 摘要:这个 PR 引入了一个独立的 RAG benchmark 框架,模块拆分整体合理,README 和数据集支持范围也比较完整。但当前有 3 个阻断性的运行问题:1) 默认配置文件 config/config.yaml 的结构和运行时代码不一致,默认入口会直接因为缺少 execution 节点而失败;2) sample_locomo() 在 num_docs + sample_size + random 路径里把 dict 放进 set,会直接抛出 TypeError: unhashable type: 'dict';3) sample_locomo() 在纯 stratified 路径里把 int 和 list 相加,也会直接报错。因为这些问题都落在 README 主流程覆盖到的入口上,我这边先请求修改。
- Fix two bugs: 1. num_docs + sample_size + random path: use int indices instead of dict tuples 2. pure stratified path: use len() for list length calculation - Extract common sampling utilities: - calculate_category_targets() - stratified_sample_with_reallocation() - random_sample_qas() - sample_docs_stratified() - sample_docs_random() - Reduce code duplication by ~60-70% - Improve maintainability and readability - Keep full backward compatibility
- Add FinanceBench to supported datasets list - Change to template configuration format - Add execution: section for better organization
- Remove duplicate monitor.worker_end(success=False) call in run_generation() - The _process_generation_task() already calls worker_end() in its exception handler - This prevents double-counting of failed tasks and distorted statistics
- Add JSON file support to _get_required_syllabi() - Extract syllabus names from JSON keys (same format as _load_from_json()) - This ensures data_prepare() processes correct docx files when using JSON input
- Replace 'raise e' with bare 'raise' to preserve original traceback - Also remove unused 'e' variable since we don't need it - This makes debugging easier by showing where the exception actually occurred
- In Locomo prompt, use gold_answer_str instead of gold_answer - This ensures consistent formatting when gold_answer is a list - Both Locomo and Generic prompts now use the same ' | ' separated format
- Replace manual common ancestor calculation with os.path.commonpath() - os.path.commonpath() handles all OS path separators correctly - Add try-except to handle ValueError when no common path exists - More robust than manual split(os.sep) approach
|
Failed to generate code suggestions for PR |
qin-ctx
left a comment
There was a problem hiding this comment.
This PR moves the RAG benchmark effort in the right direction: the adapter/pipeline split is sensible, and the documentation is much more complete than a typical first feature drop. I also verified that the three blocking issues from the previous review have been addressed in the current head.
That said, there are still three blocking correctness issues on the main benchmark path: skip_ingestion does not actually skip the preprocessing dependency on raw source documents, Qasper mixed-annotator questions can still be graded as fully correct for a refusal answer, and LLM invocation failures are converted into normal answer strings instead of failing the run. Those issues affect either promised behavior or benchmark result integrity, so I am requesting changes.
Non-blocking note: CI is still red, including ruff format --check, and there is no benchmark-specific test coverage yet for these newly added paths.
| answer_type = self._get_answer_type(answer_obj) | ||
|
|
||
| # Process different answer types | ||
| if answer_obj.get("unanswerable", False): |
There was a problem hiding this comment.
[Bug] (blocking) This still keeps "Not mentioned" as a valid gold answer for mixed-annotator questions. The question is skipped only when all annotators mark it unanswerable, but inside the per-annotator loop any single unanswerable=true answer is still appended to gold_answers. That means a question with one real answer and one unanswerable annotation can become ['Not mentioned', '42'], and the later refusal heuristic in pipeline.py will score a model refusal as fully correct. In other words, benchmark metrics can be inflated by accepting a refusal on answerable questions. For non-fully-unanswerable questions, unanswerable annotations should be ignored rather than preserved as a gold answer.
Description
RAG benchmark
RAG benchmark 是一个用于评测 Openviking 的 RAG (检索增强生成) 系统性能的框架,支持多个数据集和多种评测指标。
RAG benchmark is a framework for evaluating Openviking‘s RAG (Retrieval-Augmented Generation) system performance, supporting multiple datasets and metrics.
Features
详细文档请查看 benchmark/RAG/README.md。
See benchmark/RAG/README.md for detailed documentation.
Related Issue
Summary
This PR adds the RAG benchmark framework.
Closes #885
Type of Change
Changes Made
Testing
Checklist
Screenshots (if applicable)
Additional Notes