Feature: add train-eval loop (#3569)

Yunnglin · web-flow · commit b599898f9375 · 2025-03-19T20:31:53.000+08:00
* update args

* remove sh

* update eval loop

* update doc
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -329,6 +329,11 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 - temperature: 覆盖生成参数。predict_with_generate=True时的temperature，默认0
 - optimizer: plugin的自定义optimizer名称，默认为None
 - metric: plugin的自定义metric名称。默认为None，即在predict_with_generate=False的情况下设置为'acc'，在predict_with_generate=True的情况下设置为'nlg'
+- eval_use_evalscope: 是否使用evalscope进行训练时评测，需要设置该参数来开启评测，具体使用参考[示例](../Instruction/评测.md#训练中评测)
+- eval_datasets: 评测数据集，可设置多个数据集，用空格分割
+- eval_datasets_args: 评测数据集参数，json格式，可设置多个数据集的参数
+- eval_limit: 评测数据集采样数
+- eval_generation_config: 评测时模型推理配置，json格式，默认为`{'max_tokens': 512}`
 
 ### RLHF参数
 RLHF参数继承于[训练参数](#训练参数)
diff --git a/docs/source/Instruction/评测.md b/docs/source/Instruction/评测.md
@@ -88,6 +88,49 @@ swift eval \
 
 具体评测的参数列表可以参考[这里](命令行参数.md#评测参数)。
 
+## 训练中评测
+
+SWIFT支持在训练过程中使用EvalScope对当前的模型进行评测，以便及时了解模型的训练效果。
+
+**基本示例**
+
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+  --model "Qwen/Qwen2.5-0.5B-Instruct" \
+  --train_type "lora" \
+  --dataset "AI-ModelScope/alpaca-gpt4-data-zh#100" \
+  --torch_dtype "bfloat16" \
+  --num_train_epochs "1" \
+  --per_device_train_batch_size "1" \
+  --learning_rate "1e-4" \
+  --lora_rank "8" \
+  --lora_alpha "32" \
+  --target_modules "all-linear" \
+  --gradient_accumulation_steps "16" \
+  --save_steps "50" \
+  --save_total_limit "5" \
+  --logging_steps "5" \
+  --max_length "2048" \
+  --eval_strategy "steps" \
+  --eval_steps "5" \
+  --per_device_eval_batch_size "5" \
+  --eval_use_evalscope \
+  --eval_datasets "gsm8k" \
+  --eval_datasets_args '{"gsm8k": {"few_shot_num": 0}}' \
+  --eval_limit "10"
+```
+
+注意启动命令为`sft`，其中eval相关的参数有：
+- eval_strategy: 评估策略。默认为None，跟随`save_strategy`的策略
+- eval_steps: 默认为None，如果存在评估数据集，则跟随`save_steps`的策略
+- eval_use_evalscope: 是否使用evalscope进行评测，需要设置该参数来开启评测
+- eval_datasets: 评测数据集，可设置多个数据集，用空格分割
+- eval_datasets_args: 评测数据集参数，json格式，可设置多个数据集的参数
+- eval_limit: 评测数据集采样数
+- eval_generation_config: 评测时模型推理配置，json格式，默认为`{'max_tokens': 512}`
+
+
 更多评测的样例可以参考[examples](https://github.com/modelscope/ms-swift/tree/main/examples/eval)
 
 ## 自定义评测集
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -337,6 +337,11 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 - temperature: Generation parameter override. The temperature setting when `predict_with_generate=True`, defaulting to 0.
 - optimizer: Custom optimizer name for the plugin, defaults to None.
 - metric: Custom metric name for the plugin. Defaults to None, with the default set to 'acc' when `predict_with_generate=False` and 'nlg' when `predict_with_generate=True`.
+- eval_use_evalscope: Whether to use evalscope for evaluation, this parameter needs to be set to enable evaluation, refer to [example](../Instruction/Evaluation.md#evaluation-during-training). Default is False.
+- eval_datasets: Evaluation datasets, multiple datasets can be set, separated by spaces
+- eval_datasets_args: Evaluation dataset parameters in JSON format, parameters for multiple datasets can be set
+- eval_limit: Number of samples from the evaluation dataset
+- eval_generation_config: Model inference configuration during evaluation, in JSON format, default is `{'max_tokens': 512}`
 
 ### RLHF Arguments
 
diff --git a/docs/source_en/Instruction/Evaluation.md b/docs/source_en/Instruction/Evaluation.md
@@ -88,6 +88,48 @@ Where:
 
 For a specific list of evaluation parameters, please refer to [here](./Command-line-parameters.md#evaluation-arguments).
 
+## Evaluation During Training
+
+SWIFT supports using EvalScope to evaluate the current model during the training process, allowing for timely understanding of the model's training effectiveness.
+
+**Basic Example**
+
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+  --model "Qwen/Qwen2.5-0.5B-Instruct" \
+  --train_type "lora" \
+  --dataset "AI-ModelScope/alpaca-gpt4-data-zh#100" \
+  --torch_dtype "bfloat16" \
+  --num_train_epochs "1" \
+  --per_device_train_batch_size "1" \
+  --learning_rate "1e-4" \
+  --lora_rank "8" \
+  --lora_alpha "32" \
+  --target_modules "all-linear" \
+  --gradient_accumulation_steps "16" \
+  --save_steps "50" \
+  --save_total_limit "5" \
+  --logging_steps "5" \
+  --max_length "2048" \
+  --eval_strategy "steps" \
+  --eval_steps "5" \
+  --per_device_eval_batch_size "5" \
+  --eval_use_evalscope \
+  --eval_datasets "gsm8k" \
+  --eval_datasets_args '{"gsm8k": {"few_shot_num": 0}}' \
+  --eval_limit "10"
+```
+
+Note that the launch command is `sft`, and the evaluation-related parameters include:
+- eval_strategy: Evaluation strategy. Defaults to None, following the `save_strategy` policy
+- eval_steps: Defaults to None. If an evaluation dataset exists, it follows the `save_steps` policy
+- eval_use_evalscope: Whether to use evalscope for evaluation, this parameter needs to be set to enable evaluation
+- eval_datasets: Evaluation datasets, multiple datasets can be set, separated by spaces
+- eval_datasets_args: Evaluation dataset parameters in JSON format, parameters for multiple datasets can be set
+- eval_limit: Number of samples from the evaluation dataset
+- eval_generation_config: Model inference configuration during evaluation, in JSON format, default is `{'max_tokens': 512}`
+
 More evaluation examples can be found in [examples](https://github.com/modelscope/ms-swift/tree/main/examples/eval).
 
 ## Custom Evaluation Datasets
diff --git a/examples/eval/train_eval/train.sh b/examples/eval/train_eval/train.sh
@@ -0,0 +1,24 @@
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+  --model "Qwen/Qwen2.5-0.5B-Instruct" \
+  --train_type "lora" \
+  --dataset "AI-ModelScope/alpaca-gpt4-data-zh#100" \
+  --torch_dtype "bfloat16" \
+  --num_train_epochs "1" \
+  --per_device_train_batch_size "1" \
+  --learning_rate "1e-4" \
+  --lora_rank "8" \
+  --lora_alpha "32" \
+  --target_modules "all-linear" \
+  --gradient_accumulation_steps "16" \
+  --save_steps "50" \
+  --save_total_limit "5" \
+  --logging_steps "5" \
+  --max_length "2048" \
+  --eval_strategy "steps" \
+  --eval_steps "5" \
+  --per_device_eval_batch_size "5" \
+  --eval_use_evalscope \
+  --eval_datasets "gsm8k" \
+  --eval_datasets_args '{"gsm8k": {"few_shot_num": 0}}' \
+  --eval_limit "10"
diff --git a/swift/llm/eval/utils.py b/swift/llm/eval/utils.py
@@ -0,0 +1,53 @@
+from dataclasses import asdict
+from typing import Any, Dict, List, Union
+
+import torch.nn as nn
+from evalscope.models.custom import CustomModel
+from transformers import PreTrainedModel
+
+from ..infer import PtEngine, RequestConfig
+from ..template import InferRequest
+
+
+class EvalModel(CustomModel):
+
+    def __init__(self, model: Union[PreTrainedModel, nn.Module], template, max_batch_size, model_name: str,
+                 **kwargs) -> None:
+        super().__init__(config={'model_id': model_name}, **kwargs)
+        self.model_name = model_name
+        self.model = model
+        self.template = template
+        self.engine = PtEngine.from_model_template(model, template, max_batch_size=max_batch_size)
+
+    def predict(self, prompts: List[dict], **kwargs) -> List[Dict[str, Any]]:
+        # use origin inputs
+        infer_requests = self.prepare_inputs(kwargs.get('origin_inputs', prompts))
+
+        infer_cfg = kwargs['infer_cfg'].copy()
+        generation_config = RequestConfig(**infer_cfg)
+
+        response = self.engine.infer(infer_requests=infer_requests, request_config=generation_config, use_tqdm=False)
+        dict_response = [asdict(item) for item in response]
+        return dict_response
+
+    def prepare_inputs(self, prompts: Union[List[dict], List[str]]) -> List[InferRequest]:
+        infer_requests = []
+        for input_item in prompts:
+            if isinstance(input_item, str):
+                query = input_item
+                system_prompt = None
+            else:
+                data: list = input_item['data']
+                if isinstance(data[0], tuple):  # for truthful_qa and hellaswag
+                    query = '\n'.join(''.join(item) for item in data)
+                    system_prompt = input_item.get('system_prompt', None)
+                else:
+                    query = data[0]
+                    system_prompt = input_item.get('system_prompt', None)
+            #  prepare messages
+            messages = []
+            if system_prompt:
+                messages.append({'role': 'system', 'content': system_prompt})
+            messages.append({'role': 'user', 'content': query})
+            infer_requests.append(InferRequest(messages=messages))
+        return infer_requests
diff --git a/swift/trainers/arguments.py b/swift/trainers/arguments.py
@@ -43,6 +43,13 @@ class TrainArgumentsMixin:
     fsdp_num: int = 1
     acc_steps: int = 1
 
+    # train-eval loop args
+    eval_use_evalscope: bool = False
+    eval_datasets: List[str] = field(default_factory=list)
+    eval_limit: Optional[int] = None
+    eval_datasets_args: Optional[Union[str, dict]] = None
+    eval_generation_config: Optional[Union[str, dict]] = None
+
     def _fix_gradient_checkpointing(self):
         # fix use_reentrant
         if hasattr(torch.utils.checkpoint, '_old_checkpoint'):  # avoid double patching
@@ -75,6 +82,15 @@ def __post_init__(self):
         if getattr(self, 'gradient_checkpointing_kwargs', None):
             self.gradient_checkpointing_kwargs = ModelArguments.parse_to_dict(self.gradient_checkpointing_kwargs)
         self._fix_gradient_checkpointing()
+
+        if self.eval_use_evalscope:
+            try:
+                import evalscope
+            except ImportError:
+                raise ImportError('evalscope is not installed, please install it by `pip install evalscope`')
+            self.eval_datasets_args = ModelArguments.parse_to_dict(self.eval_datasets_args)
+            self.eval_generation_config = ModelArguments.parse_to_dict(self.eval_generation_config)
+
         super().__post_init__()
 
 
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
@@ -340,6 +340,9 @@ def _maybe_log_save_evaluate(self, tr_loss, *args, **kwargs):
             self._globalstep_last_logged = self.state.global_step
             self.store_flos()
             self.log(logs)
+
+        if self.args.eval_use_evalscope and self.control.should_evaluate:
+            self._evalscope_eval()
         super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
 
     def create_optimizer_and_scheduler(self, num_training_steps: int):
@@ -382,3 +385,32 @@ def _compute_acc(self, outputs, labels) -> None:
                 if k not in self._custom_metrics:
                     self._custom_metrics[k] = MeanMetric(nan_value=None)
                 self._custom_metrics[k].update(v)
+
+    @torch.no_grad()
+    def _evalscope_eval(self):
+        from ..llm.eval.utils import EvalModel
+        from evalscope import TaskConfig, run_task
+        from evalscope.constants import EvalType
+
+        self.model.eval()
+        max_batch_size = self.args.per_device_eval_batch_size
+        custom_model = EvalModel(
+            self.model, self.template, max_batch_size=max_batch_size, model_name=f'model-step{self.state.global_step}')
+        task_config = TaskConfig(
+            model=custom_model,
+            eval_type=EvalType.CUSTOM,
+            datasets=self.args.eval_datasets,
+            dataset_args=self.args.eval_datasets_args,
+            limit=self.args.eval_limit,
+            work_dir=os.path.join(self.args.output_dir, 'eval'),
+            eval_batch_size=max_batch_size,
+            generation_config=self.args.eval_generation_config or {'max_tokens': 512},
+        )
+        # start evaluation
+        eval_report = run_task(task_config)
+        # convert to dict
+        eval_dict = {f'test_{k}': v.score for k, v in eval_report.items()}
+        self.log(eval_dict)
+
+        self.model.train()
+        return eval_dict
diff --git a/tests/train/test_train_eval.py b/tests/train/test_train_eval.py
@@ -0,0 +1,34 @@
+import os
+
+kwargs = {
+    'per_device_train_batch_size': 5,
+    'save_steps': 5,
+    'gradient_accumulation_steps': 1,
+    'num_train_epochs': 1,
+}
+
+
+def test_train_eval_loop():
+    os.environ['CUDA_VISIBLE_DEVICES'] = '0,2'
+    from swift.llm import sft_main, TrainArguments
+    sft_main(
+        TrainArguments(
+            model='Qwen/Qwen2.5-0.5B-Instruct',
+            dataset=['AI-ModelScope/alpaca-gpt4-data-zh#100'],
+            target_modules=['all-linear', 'all-embedding'],
+            modules_to_save=['all-embedding', 'all-norm'],
+            eval_strategy='steps',
+            eval_steps=5,
+            per_device_eval_batch_size=5,
+            eval_use_evalscope=True,
+            eval_datasets=['gsm8k'],
+            eval_datasets_args={'gsm8k': {
+                'few_shot_num': 0
+            }},
+            eval_limit=10,
+            report_to=['wandb'],
+            **kwargs))
+
+
+if __name__ == '__main__':
+    test_train_eval_loop()