UltraRAG has introduced an intent recognition feature in this module. For questions that do not require retrieval, a response can be generated directly, thereby improving response efficiency. For questions that require retrieval, the system will perform a single retrieval and re-ranking, then generate an accurate response based on the retrieved content.
Recall includes retrieved content. If intent recognition determines that retrieval is not needed, a response is generated directly.
Direct Preference Optimization (DPO) and Supervised Fine-Tuning (SFT) are two core techniques for enhancing the performance and alignment capabilities of large language models (LLM). Each from the perspective of preference optimization and supervised learning, these methods provide different solutions for complex tasks and play complementary roles in building high-quality generative models.
The core idea of DPO is directly using preference data to optimize the model, making its output more aligned with the actual needs of users. This method structures preference pairs using user feedback and formalizes the generative task as a process of optimizing preference distribution. Unlike reward modeling based on reinforcement learning (e.g., RLHF), DPO directly optimizes the preference function to adjust model behavior, avoiding the complexities of strategy optimization and significantly improving the alignment and user satisfaction of generative results.
SFT employs a classical supervised learning paradigm, fine-tuning the model using high-quality input-output pairs (such as task-labeled data) with the aim of accurately learning the mapping relationship of specific tasks to provide strong initial performance. The SFT training process typically relies on large-scale annotated data and improves generation quality by minimizing training error. However, due to its reliance on labeled data, its generalization ability in handling diverse scenarios is relatively limited.
We provide DPO and SFT fine-tuning training schemes based on the trl library. Users can convert their data into the corresponding format and then proceed with the training process as needed.
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
pipeline_type | Yes | str | Specify method | DPO (choices: DPO, SFT) |
task_type | Yes | str | Specify task type | DPO (choices: DPO, SFT) |
use_lora | No | bool(flag) | Specify whether to use lora fine-tuning during training | - |
model_name_or_path | Yes | str | Path to the model for training | your_training_model_path |
train_data_path | Yes | str | Path to the training dataset (Note: If providing an external dataset, place the data under ~/resource/dataset/train_dataset/ for selection) | ~/resource/dataset/train_dataset/dpos_train.jsonl |
eval_data_path | Yes | str | Path to the validation dataset (Note: If providing an external validation dataset, place the data under ~/resource/dataset/train_dataset/ for selection) | ~/resource/dataset/train_dataset/dpos_dev.jsonl |
output_dir | Yes | str | Path to save the trained model | ~/output/ddr |
logging_dir | Yes | str | Path to save training logs | ~/output/logs/ddr |
deepspeed_config_file | Yes | str | Path to the deepspeed settings file | ~/config/ds_config_zero2.json |
config_file | Yes | str | YAML configuration file path | ~/config/pipeline/finetune.yaml |
log_file | Yes | str | Path to save training logs | ~/output/logs/ddr/finetune_run.log |
To simplify user operation experience, only necessary parameters are provided by the system. Some parameters are preset in the YAML configuration file. Users can directly use the default values or personalize them according to specific needs. Training parameters are implemented based on the transformers.TrainingArguments class, offering high flexibility and allowing users to customize and extend according to actual requirements to suit various training scenarios.
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
Augment_template | Yes | str | Data augmentation template | Background{}Question:{}Answer: |
QA_template | Yes | str | Question-answering template | Question:{}Answer: |
passage_separator | Yes | str | Separator between different documents | |
model_type | Yes | str | Specify the model type | minicpm3 (choices: minicpm3, minicpm2, llama_style) |
use_template | Yes | bool | Specify whether to use a template in the model input stage | True |
max_length | Yes | int | Only for DPO training, maximum length of input sequence (including prompt and completion) | 2200 |
max_prompt_length | Yes | int | Only for DPO training, maximum length of prompt (should be less than max_length) | 2100 |
max_seq_length | Yes | int | Only for SFT training, maximum length of input sequence (including prompt and completion) | 2200 |
max_passage_length | Yes | int | Maximum length of retrieval document (should be less than max_prompt_length or max_seq_length) | 2000 |
top_n | Yes | int | Number of documents to return during retrieval | 5 |
optim | Yes | str | Type of optimizer | adamw_torch |
save_steps | Yes | int | Interval steps for saving the model | 100 |
eval_steps | Yes | int | Interval steps for evaluation | 100 |
per_device_train_batch_size | Yes | int | Training batch size per device | 1 |
per_device_eval_batch_size | Yes | int | Evaluation batch size per device | 2 |
learning_rate | Yes | float | Learning rate | 5e-5 |
eval_strategy | Yes | str | Evaluation strategy | steps |
logging_steps | Yes | int | Interval steps for logging | 10 |
bf16 | Yes | bool | Whether to enable BF16 (half precision floating point) | True |
num_train_epochs | Yes | int | Total number of training epochs | 1 |
{"query": "xxx", "retrieval_result": ["xxx", "xxx", "xxx", "xxx", "xxx"],
"chosen": {"text": "xxx"},
"rejected": {"text": "xxx"}}
{"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "..."}]}
If LoRA fine-tuning is used during training, the LoRA fine-tuned parameters need to be merged with the original model parameters after training to generate the complete model weights.
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
model_name_or_path | Yes | str | Path to the trained model | your_training_model_path |
lora_name_or_path | Yes | str | Path to the LoRA fine-tuned parameters to be merged | your_lora_model_path |
save_path | Yes | str | Path to save the merged model | your_save_model_path |
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
pipeline_type | Yes | str | Specify the method | vanilla |
embedding_model_path | No | str | Path to the embedding model | your_embedding_model_path |
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
selected_retrieval_metrics | No | str (list) | List of retrieval metrics to evaluate | [] |
pooling | No | str | Pooling strategy for text representation | Default: "mean" |
query_instruction | No | str | Instructions for extracting query text | Default: None |
queries_path | Yes | str | Path to the query file | Example: "path/to/queries.txt" |
corpus_path | Yes | str | Path to the corpus file | Example: "path/to/corpus.txt" |
qrels_path | Yes | str | Path to the qrels (query-relevance file) | Example: "path/to/qrels.txt" |
retrieval_output_path | Yes | str | Path to save the retrieval output | Example: "path/to/output.txt" |
log_path | No | str | Path to save the log file | Default: None |
topk | No | int | The number of top-k documents to retrieve | Default: 10 |
cutoffs | No | str | Cutoff values for evaluation metrics | Default: None |
MRR、NDCG、Recall
The last line of the output file will contain the average scores for all metrics.
Input Data:
Includes three files: query.jsonl
(query data), corpus.jsonl
(document data), and qrels.tsv
(triplet file).
- Query data format (
query.jsonl
) and document data format (corpus.jsonl
):
{"_id": "aaa", "text": "This is document 1"}
{"_id": "aaa", "text": "This is query 1"}
- Triplet file format (
qrels.tsv
): (Note: tab-separated)
query-id corpus-id score
aaa bbb 1
Output Data:
result.trec
(Note: tab-separated)
aaa Q0 bbb 1 0.1 1
Meaning of each field:
<query_id> Q0 <doc_id> <rank> <score> <run_id>
The LLM model parameters need to be configured in UltraRAG/config/pipeline/eval/eval.yaml
Parameter Name | Required | Parameter Type | Description | Example/Default Value |
---|---|---|---|---|
selected_generated_metrics | No | str (list) | List of generation metrics to evaluate | Default: [] |
test_dataset | Yes | str (list) | List of dataset files (json or jsonl) | dataset1.json dataset2.jsonl |
output_path | Yes | str | Path to save results | results/output.json |
knowledge_id | No | str (list) | List of knowledge bases | collection1 collection2 |
knowledge_stat_tab_path | No | str | Path to the knowledge management table | your_knowledge_stat_tab_path |
evaluate_only | No | bool (flag) | If set, skip generation and directly evaluate the dataset (must meet the retrieval or generation evaluation input format) | False |
metric_api_key | No | str | API key for the model used in metric evaluation | your_api_key |
metric_base_url | No | str | Base URL for the model used in metric evaluation | your_base_url |
metric_model_name | No | str | Model name for the model used in metric evaluation | your_model_name |
api_key | No | str | API key for the model being evaluated | your_api_key |
base_url | No | str | Base URL for the model being evaluated | your_base_url |
model_name | No | str | Model name for the model being evaluated | your_model_name |
reranker_model_path | No | str | Path to the reranker model | your_reranker_model_path |
Completeness(RAGEval)、Rouge、EM、Accuracy、F1、BLEU、Meteor、Bert
The last line of the output file will contain the average scores for all metrics respectively.
Input Data
Must include query and answer. If a system prompt that is not to be retrieved is needed, pass it in via instruction.
{"id": 0, "query": "xxx?", "answer": "xxx", "prediction": "xxxyyy", "instruction":"this is optional key"}
{"id": 0, "query": "aaa?", "answer": "bbb", "prediction": "bbb"}
Output Data
{"id": 0, "query": "xxx?", "answer": "xxx", "prediction": "xxxyyy", "xxx_score": 20.12, "xxx_score": 20.26}
{"id": 0, "query": "aaa?", "answer": "bbb", "prediction": "bbb", "x_score": 100.00, "xx_score": 100.00}
{"average_scores": {"x": 60.06, "xx": 60.13}
Paper URL: Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation
GitHub URL:UltraRAG-Adaptive-Note**
To address the challenges of lack of information and poor interactivity faced by current Retrieval-Augmented Generation (RAG) systems in complex questioning tasks, we propose a novel end-to-end approach called UltraRAG-Adaptive-Note. This method consists of three core modules:
-
Iterative Information Collector (IIC)
IIC uses notes as a knowledge carrier to systematically integrate and dynamically update retrieved information. Initially, Large Language Models (LLMs) generate initial notes from retrieved references and store them as optimal memory. During iterations, based on existing optimal memory, IIC predicts new retrieval queries and continuously updates the notes to achieve dynamic knowledge expansion.
-
Adaptive Memory Reviewer (AMR)
AMR dynamically evaluates the quality of updated notes and optimal memory content, deciding whether to replace existing notes. Additionally, AMR implements a note-based exploration stop strategy to prevent excessive searches. This strategy ensures timely termination of information collection when information gain becomes insignificant, enhancing system efficiency.
-
Task-Oriented Generator
This module extracts key information from optimal memory to generate high-quality answers and supports various question-answering formats, ensuring answer specificity and accuracy.
Through the synergistic effect of the above modules, UltraRAG-Adaptive-Note achieves efficient solutions to complex problems from a knowledge growth perspective, demonstrating significant performance advantages in multi-hop Q&A and long-text generation tasks.
Paper URL: KBAlign: Efficient Self Adaptation on Specific Knowledge Bases
GitHub URL: KBAlign GitHub
UltraRAG-KBAlign aims to enhance large language models' (LLMs) ability to adapt knowledge efficiently when handling tasks involving knowledge bases. Unlike traditional methods that rely on external signals (such as human preference data or annotations from more powerful LLMs), KBAlign employs self-supervised learning to achieve knowledge adaptation efficiently and cost-effectively. The method mainly includes three key components:
Self-Annotated Training Data Combining Long and Short Dependencies
By combining long and short dependencies, it automatically generates high-quality training data to improve the model's understanding and adaptability to information in knowledge bases.
Self-Verification and Iterative Training
It employs a self-verification mechanism to continuously optimize the model during iterative training, enabling it to gradually improve knowledge alignment in an unsupervised environment.
Inference Optimization
During the inference stage, it optimizes the generation process using aligned knowledge representations to ensure the accuracy and consistency of the answers.
UltraRAG-KBAlign uses a self-annotation method combining long and short dependencies to construct training data, enhancing the model's ability to adapt knowledge.
- Short Dependency Annotation focuses only on the local information of a single chunk, ensuring the model's precise understanding of fine-grained knowledge.
- Long Dependency Annotation is divided into homogeneous data and heterogeneous data to build richer Q&A pairs:
- Homogeneous data creates ambiguous questions by integrating multiple related paragraphs to obtain the final answer;
- Heterogeneous data uses methods like clustering to extract more global Q&A pairs from different sections, enhancing the model's cross-sectional inference ability.
Parameter Name | Required | Parameter Type | Description | Example/Default Value |
---|---|---|---|---|
model_name_or_path | Yes | str | Path to the model to be fine-tuned | |
config_path | Yes | str | Path to the YAML configuration file | |
embedding_model_path | Yes | str | Path to the embedding model | |
knowledge_id | Yes | str | ID of the knowledge set in Qdrant | |
knowledge_stat_tab_path | Yes | str | Path to the knowledge statistics table | |
clustering | No | store_true | Whether data needs clustering (heterogeneous data) | False |
output_dir | Yes | str | Path to the output directory | |
language | Yes | str | Language type (Chinese/English) | Chinese or English |
functions_to_run | Yes | str | Name of functions to execute (e.g., function_q or function_qr) | function_q function_qr |
file_list | Yes | list | List of JSON or JSONL files to be merged | |
ratios | Yes | list | Ratio for each file, such as 1:1 | [1, 1] |
fixed_steps | No | int | Fixed number of merge steps | Default is None, a user-provided integer |
random_merge | No | store_true | Whether to randomly shuffle data before merging | Default is False |
output_file | Yes | str | Path to the merged output file | |
output_format | Yes | str | Output format (json or jsonl) | json or jsonl |
UltraRAG-KBAlign uses iterative self-verification for training, requiring configuration of fixed data volume, iteration counts, and other key parameters. During each training round, the model answers a portion of the data and self-verifies based on the generated answers. Verification results are incorporated into the next round of training data, promoting continuous optimization and gradual improvement in performance and verification ability.
During the inference phase, UltraRAG-KBAlign enhances performance through methods such as Query Expansion and Confidence Check. Training with the KBAlign method enhances the model's mastery of knowledge bases and self-verification capabilities, enabling further optimization in the accuracy of expanded query retrieval and confidence assessment.
Paper URL: RAG-DDR: OPTIMIZING RETRIEVAL-AUGMENTED GENERATION USING DIFFERENTIABLE DATA REWARDS
Project URL: RAG-DDR
Existing RAG systems face two major challenges: first, the retrieved documents may contain a large amount of noisy information; second, the retrieved external knowledge may conflict with the inherent knowledge in the model parameters. These issues significantly affect the accuracy and reliability of large language models (LLMs) during the generation process. To enhance the retrieval-augmented capabilities of LLMs, a common approach is to conduct supervised fine-tuning based on knowledge-intensive tasks. However, due to a heavy reliance on labeled data during training, these models often have limited generalization capabilities and struggle to adapt to complex and diverse real-world application scenarios.
To address the above issues, we propose an end-to-end optimization scheme for RAG systems—UltraRAG-DDR. This method utilizes rollout techniques to systematically evaluate the reward scores of the RAG module, and optimizes the model to better align with data preferences. By targeted data sampling for specific task scenarios, preference data pairs that meet the Direct Preference Optimization (DPO) method requirements are generated, and efficient training based on DPO is conducted, significantly enhancing the system's performance in specific tasks.
In UltraRAG-DDR, we use preprocessed document information from the knowledge base to construct triplet data containing Query, Ground-truth, and Keypoints using a high-performance model, and generate a Reference via a retrieval model, thus creating a standardized raw dataset. Additionally, the DDR data sampling strategy is employed to enable the model to generate diverse responses for each query in both inherent and external knowledge scenarios by adjusting temperature parameters and using a repetitive sampling mechanism. Supervised labels are utilized to naturally obtain reward scores for each response, with the highest-scoring response selected as a positive example and the lowest as a negative example, thereby constructing high-quality preference training data pairs.
This data construction strategy not only significantly improves the model's adaptability and generation quality across different knowledge scenarios, but also achieves highly integrated one-click data construction functionality. Users simply need to upload documents and select a target model to automatically complete the entire process of generating training data, with support for various training methods like DPO and SFT. This scheme significantly lowers operational barriers and data construction costs, providing a more efficient and convenient solution for RAG research and practice.
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
pipeline_type | Yes | str | Specify the method | ddr |
Train Model Name or Path | Yes | str | Path to the model for training | your_training_model_path |
Data Model Name or Path | Yes | str | Path to the model for data construction (stronger performance than training model) | your_data_constructing_model_path |
Embedding Model Path | Yes | str | Path to the embedding model | your_embedding_model_path |
Config Path | Yes | str | Path to the yaml configuration file | ~/config/pipeline/ddr/datasets.yaml |
Train Output Path | Yes | str | Output path for the training set | ~/resource/dataset/train_dataset/dpos_train.jsonl |
Dev Output Path | Yes | str | Output path for the validation set | ~/resource/dataset/train_dataset/dpos_dev.jsonl |
current_kb_config_id | Yes | str | Knowledge base configuration ID (automatically inputted after knowledge base configuration) | your_current_kb_config_id |
knowledge_id | Yes | str | Knowledge base ID (automatically inputted after knowledge base configuration) | your_knowledge_id |
knowledge_stat_tab_path | Yes | str | Path to the knowledge base management table (automatically inputted after knowledge base configuration) | your_knowledge_stat_tab_path |
To simplify the user operation experience, we have only listed the necessary parameters. Some optimization parameters have been preset in the YAML configuration file, and users can directly use the default values or adjust them according to special requirements:
Parameter Name | Required | Parameter Type | Description | Example/Default Value |
---|---|---|---|---|
VllmServer_params | Yes | - | VLLM service configuration | - |
sampling_params | Yes | - | Generation control parameters (to construct data model) | - |
max_data_nums | Yes | int | Maximum number of constructed data | 5000 |
top_k | Yes | int | Number of documents returned during retrieval | 5 |
method | Yes | str | Retrieval method, e.g., "dense" indicates using dense retrieval | dense |
Augment_template | Yes | str | Data augmentation template | Background{}Question:{}Answer: |
QA_template | Yes | str | Q&A template | Question:{}Answer: |
max_prompt_length | Yes | int | Maximum length of the prompt | 4096 |
max_passage_length | Yes | int | Maximum length of the retrieved document (should be less than max_prompt_length) | 2000 |
passage_separator | Yes | str | Separator between different documents | |
model_type | Yes | str | Specify the type of model | minicpm3 (options: minicpm3, minicpm2, llama_style) |
use_template | Yes | bool | Specify whether to use a template during the model input stage | True |
batch_size | Yes | int | Batch size during data processing | 64 |
dpo_sampling_params | Yes | - | DPO uses generation control parameters (model to be trained) | - |
metric | Yes | str | Sampling evaluation metric | rouge (options: rouge, em, accuracy, f1) |
ratio | Yes | float | Division ratio of training and test data, e.g., "0.1" means 10% of the data is used for testing. | 0.1 |
{"file_index": 1, "chunk_index": 1, "chunk": "xxx","query": "xxx", "ground_truth": "xxx", "keypoints": "1. xxx\n2. xxx" , "retrieval_result": ["xxx", "xxx", "xxx", "xxx", "xxx"], "id": 1, "raw_input": "xxx", "augment_input": "xxx",
"context": [
{"text": "xxx", "temperature": 0.5, "type": "raw", "x_score": 0.85},
{"text": "xxx", "temperature": 0.5, "type": "aug", "x_score": 0.62},
{"text": "xxx", "temperature": 0.6, "type": "raw", "x_score": 0.59},
{"text": "xxx", "temperature": 0.6, "type": "aug", "x_score": 0.43,
{"text": "xxx", "temperature": 0.7, "type": "raw", "x_score": 0.58},
{"text": "xxx", "temperature": 0.7, "type": "aug", "x_score": 0.69},
{"text": "xxx", "temperature": 0.8, "type": "raw", "x_score": 0.25},
{"text": "xxx", "temperature": 0.8, "type": "aug", "x_score": 0.74},
{"text": "xxx", "temperature": 0.9, "type": "raw", "x_score": 0.55},
{"text": "xxx", "temperature": 0.9, "type": "aug", "x_score": 0.91}
],
"chosen": {"text": "xxx", "temperature": 0.9, "type": "aug", "x_score": 0.91},
"rejected": {"text": "xxx", "temperature": 0.8, "type": "raw", "x_score": 0.25}}
Paper link: https://arxiv.org/abs/2410.10594
Model link: https://huggingface.co/openbmb/VisRAG-Ret
Repository link: https://github.com/OpenBMB/VisRAG
UltraRAG-Vis is a novel Retrieval-Augmented Generation (RAG) pipeline based on visual language models (VLM). Unlike traditional text parsing methods, UltraRAG-Vis directly embeds documents as images and uses VLM for retrieval and generation. This approach maximizes the retention and utilization of data in the original documents, avoiding potential information loss that may occur during traditional text parsing.
The workflow of UltraRAG-Vis is primarily divided into two modules: Retrieval Module (VisRAG-Ret) and Generation Module (VisRAG-Gen). Unlike traditional text parsing methods, UltraRAG-Vis directly utilizes image embeddings for information retrieval, avoiding losses that may be introduced in traditional document parsing processes (such as OCR or text extraction).
The core task of the VisRAG-Ret module is to convert the input query and document (image) into embedding vectors. This module uses the MiniCPM-V 2.0 model, integrating SigLIP as a visual encoder and MiniCPM-2B as an LLM Backbone, enabling it to handle both visual and text information simultaneously. When processing queries and documents, VisRAG-Ret first embeds them into a shared vector space and matches the most relevant documents through similarity calculations (e.g., dot product or cosine similarity).
- Input: Text query or image document.
- Processing: Perform visual encoding and language encoding on the input to generate corresponding embedding vectors.
- Output: Embedding vectors for queries and documents, used for subsequent retrieval and generation tasks.
The VisRAG-Gen module uses the documents retrieved by the VisRAG-Ret module, in combination with queries, to generate corresponding text content. Unlike traditional RAG methods, VisRAG-Gen directly uses existing visual language models (such as MiniCPM-V 2.0, MiniCPM-V 2.6, and GPT-4o) for generation tasks.
- Input: Query and documents obtained from the retrieval module.
- Processing: Pass the query and document as inputs to the generation model to generate text via the generation model.
- Output: Text generated based on the input query and related documents.
- No Document Parsing Required: UltraRAG-Vis directly accepts documents as image inputs, avoiding information loss during traditional parsing processes.
- Multimodal Processing: Handles both visual and linguistic information simultaneously, adapting to various types of documents (such as academic articles, images, or mixed text and image documents).
- Flexible Generation Capability: Directly utilizes existing visual language models for generation, providing high flexibility.
- Enhanced Information Utilization: Compared with traditional text-parsing RAG, UltraRAG-Vis maximizes information utilization by preserving the original visual information of the document.
Model Address: https://huggingface.co/openbmb/MiniCPM-Embedding-Light
MiniCPM-Embedding-Light is a bilingual text embedding model for Chinese and English, jointly developed by BAAI (Beijing Academy of Artificial Intelligence), the Natural Language Processing Laboratory of Tsinghua University (THUNLP), and the Information Retrieval Group of Northeastern University (NEUIR). It exhibits excellent performance in Chinese and English retrieval tasks as well as cross-language retrieval between Chinese and English. The model supports long texts (up to 8192 tokens) and provides dense vectors as well as token-level sparse vectors. Additionally, it allows flexible dense vector dimensions (nested representations).
Structurally, MiniCPM-Embedding-Light adopts bidirectional attention and Weighted Mean Pooling. It employs a multi-stage training approach, leveraging approximately 260 million training samples from open-source, machine-generated, and proprietary datasets. Thanks to a meticulously designed domain-adaptive data synthesis method (integrated into UltraRAG), MiniCPM-Embedding-Light demonstrates exceptional performance in retrieval tasks.
Model | C-MTEB/Retrieval(NDCG@10) | BEIR(NDCG@10) |
---|---|---|
bge-large-zh-v1.5 | 70.46 | - |
gte-large-zh | 72.49 | - |
Conan-embedding-v1 | 76.67 | |
bge-large-en-v1.5 | - | 54.29 |
modernbert-embed-large | - | 54.36 |
snowflake-arctic-embed-l | - | 55.98 |
gte-en-large-v1.5 | - | 57.91 |
me5-large | 63.66 | 51.43 |
bge-m3(Dense) | 65.43 | 48.82 |
gte-multilingual-base(Dense) | 71.95 | 51.08 |
jina-embeddings-v3 | 68.60 | 53.88 |
gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
MiniCPM-Embedding | 76.76 | 58.56 |
MiniCPM-Embedding-Light(Dense) | 72.71 | 55.27 |
MiniCPM-Embedding-Light(Dense+Sparse) | 73.13 | 56.31 |
MiniCPM-Embedding-light(Dense+Sparse)+MiniCPM-Reranker-Light | 76.34 | 61.49 |
Model | MKQA En-Zh_CN (Recall@20) | NeuCLIR22 (NDCG@10) | NeuCLIR23 (NDCG@10) |
---|---|---|---|
me5-large | 44.3 | 9.01 | 25.33 |
bge-m3(Dense) | 66.4 | 30.49 | 41.09 |
gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
MiniCPM-Embedding | 72.95 | 52.65 | 49.95 |
MiniCPM-Embedding-Light(Dense) | 68.29 | 41.17 | 45.83 |
MiniCPM-Embedding-Light(Dense)+MiniCPM-Reranker-Light | 71.86 | 54.32 | 56.50 |
In typical RAG (Retrieval-Augmented Generation) scenarios, the document repository provided by users is often highly specialized in a specific domain. Retrieval models that have not been fine-tuned on the corresponding domain generally perform poorly. However, fine-tuning the retrieval model using domain-specific data can usually significantly improve retrieval results. The challenge lies in the collection of query-doc pairs needed for model fine-tuning. To address this, we automatically generate corresponding queries for the user's provided documents using LLM (Large Language Model), perform negative example mining, and conduct data cleaning for fine-tuning the retrieval model and reranking model, thereby enhancing retrieval performance in the RAG pipeline.
This module consists of four parts: Data Preprocessing, Query Synthesis, Negative Example Mining, and Data Cleaning.
In this module, we process the user documents to extract a certain number of semantically similar documents for each document, which will be used for subsequent data synthesis. The method involves using embeddings to compute the vector representation of the documents and calculating the cosine similarity between each document and other documents in the repository, selecting the top-ranked similar documents based on similarity.
Document data format corpus.jsonl
:
{"contents": "This is document 1"}
Input Parameters:
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
embed | Yes | str | Path to the embedding model used for preprocessing | ~/UltraRAG-Vec |
pooling | No | str | Pooling method for the embedding model | mean |
corpus_path | Yes | str | Path to user document slices | ~/dataset/corpus.jsonl |
output_path | Yes | str | Output path | ~/dataset/preprocessed.jsonl |
search_start_index | No | int | Start index of the document extraction range | 1 |
search_end_index | No | int | End index of the document extraction range | 30 |
Output: Preprocessed data format synthesis_qd.jsonl
{"doc": "doc", "sims": ["doc", "doc2"]}
In this module, we synthesize corresponding queries for user documents, supporting both Chinese and English. Synthesis can be done using few-shot examples provided by the user, or zero-shot synthesis can be performed directly. We will provide the target document and a randomly selected negatively similar document (obtained in the previous stage) to the generation model. The model will generate a query related to the target document but unrelated to the negative example document.
Input file format input.jsonl
:
{"doc": "doc", "sims": ["doc", "doc2"]}
Example data shot.jsonl
, output format output.jsonl
:
{"query": "This is query1", "pos": ["This is the correct document1"]}
Outputs can also support multiple file formats (aligned with evaluation and BEIR):
Including three files: query data query.jsonl
, document data corpus.jsonl
, and triplet file qrels.tsv
.
Query data format query.jsonl
, document data format corpus.jsonl
:
{"_id": "aaa", "text": "This is document1"}
{"_id": "aaa", "text": "This is query1"}
Triplet file format qrels.tsv
(note the separator is a tab):
query-id corpus-id score
aaa bbb 1
Input Parameters:
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
api_key | Yes | str | API key for model generation similar to OpenAI | sk-114514NYNICG |
base_url | Yes | str | Server URL for model generation similar to OpenAI | ~/dataset/corpus.jsonl |
model_name | Yes | str | Name of the model for generation similar to OpenAI | gpt-4o |
language | Yes | str | User document language; currently supports Chinese and English ('zh', 'en') | zh |
input_pair_path | Yes | str | File address of the pre-processed document | ~/dataset/preprocessed.jsonl |
output_path | Yes | str | Output file address | ~/dataset/synthesis_train.jsonl ~/dataset/qrels.tsv (three-file format) |
query_num_per_corpus | Yes | int | Number of queries to synthesize per document | 5 |
query_path | No | str | Output query file address (if provided, outputs in three-file format) | ~/dataset/synthesis_query.jsonl |
corpus_path | No | str | Output document file address (if provided, outputs in three-file format) | ~/dataset/synthesis_corpus.jsonl |
corpus_sample_num | No | int | Number of documents to create queries for | -1 (default is all) |
neg_start_index | No | int | Start index for extracting negative document examples | 1 |
negs_end_index | No | int | End index for extracting negative document examples | 30 |
shot_num | No | int | Number of shots provided during Few-shot synthesis; set to 0 for Zero-shot | 0 (Zero-shot) |
shot_file | No | str | Example file provided by the user | ~/dataset/shot.jsonl |
input_prompt_path | No | str | Custom data prompt provided by the user | ~/prompt.txt |
For both the embedding model and the reranker model, positive and negative examples need to be provided during training. In the previous step, we synthesized a query from the document, and for this query, the document is considered a positive example. Afterward, we need to mine semantically similar documents as negative examples for the query by examining the similarity between the query and documents in the document library. (Of course, the mined negative examples might also be related to the query, i.e., they might be false negatives; in the next step, we will perform data cleaning to try and filter out these false negatives.)
Training data format train.jsonl
{"query": "This is query1", "pos": ["This is the correct document1"]}
Document data format corpus.jsonl
:
{"id": "aaa", "contents": "This is document1"}
Input parameters:
Parameter Name | Required | Parameter Type | Description | Example/Default Value |
---|---|---|---|---|
embed | Yes | str | Path for the embedding model used for data preprocessing | ~/UltraRAG-Vec |
pooling | No | str | Pooling method for the embedding model used for data preprocessing | mean |
query_instruction | No | str | Instruction to be added before the query in the embedding model for data preprocessing | None (no Instruction) |
corpus_path | Yes | str | Path to the user document slices | ~/dataset/corpus.jsonl |
qrel_path | Yes | str | Path to the training data | ~/dataset/train.jsonl |
output_path | Yes | str | Output path | ~/dataset/diged.jsonl |
search_start_index | No | int | Starting index for document mining range | 1 |
search_end_index | No | int | Ending index for document mining range | 30 |
In this stage, we aim to filter out false negative and false positive samples to improve the quality of the training data. Some common methods include filtering false negatives based on the ratio or difference of the similarity scores between the query and positive/negative examples; filtering false positives based on the ranking of positives relative to the query (the method here involves filtering out negatives when a data point contains fewer negatives, which effectively removes samples where positive examples rank lower).
Data Format: train.jsonl
{"query": "This is query1", "pos": ["This is the correct document1"]}
Input Parameters:
Parameter Name | Required | Type | Description | Example/Default Value |
---|---|---|---|---|
embed/reranker | Yes | str | Path to the embedding/reranker model used for data preprocessing | ~/UltraRAG-Vec ~/UltraRAG-Reranker |
pooling | No | str | Pooling method for the embedding model used in data preprocessing | mean |
query_instruction | No | str | Instruction to be added before the query in the embedding model during data preprocessing | None (do not add Instruction) |
qrel_path | Yes | str | Path to the training data | ~/dataset/diged.jsonl |
output_path | Yes | str | Output path | ~/dataset/clean.jsonl |
search_start_index | No | int | Start index of the document negatives range | 1 |
search_end_index | No | int | End index of the document negatives range | 30 |
keep_neg_num | No | int | Number of negatives to retain per entry | 7 |
score_ratio | No | float | Maximum ratio of negative score to positive score | 1.0 |
score_margin | No | float | Minimum value of positive score minus negative score | 0.0 |
min_pos_score | No | float | Minimum positive score | 0.0 |
max_neg_score | No | float | Maximum negative score | 0.0 |