Skip to content

Commit b599898

Browse files
authored
Feature: add train-eval loop (#3569)
* update args * remove sh * update eval loop * update doc
1 parent 83b9e33 commit b599898

File tree

9 files changed

+254
-0
lines changed

9 files changed

+254
-0
lines changed

docs/source/Instruction/命令行参数.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,11 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
329329
- temperature: 覆盖生成参数。predict_with_generate=True时的temperature,默认0
330330
- optimizer: plugin的自定义optimizer名称,默认为None
331331
- metric: plugin的自定义metric名称。默认为None,即在predict_with_generate=False的情况下设置为'acc',在predict_with_generate=True的情况下设置为'nlg'
332+
- eval_use_evalscope: 是否使用evalscope进行训练时评测,需要设置该参数来开启评测,具体使用参考[示例](../Instruction/评测.md#训练中评测)
333+
- eval_datasets: 评测数据集,可设置多个数据集,用空格分割
334+
- eval_datasets_args: 评测数据集参数,json格式,可设置多个数据集的参数
335+
- eval_limit: 评测数据集采样数
336+
- eval_generation_config: 评测时模型推理配置,json格式,默认为`{'max_tokens': 512}`
332337

333338
### RLHF参数
334339
RLHF参数继承于[训练参数](#训练参数)

docs/source/Instruction/评测.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,49 @@ swift eval \
8888

8989
具体评测的参数列表可以参考[这里](命令行参数.md#评测参数)
9090

91+
## 训练中评测
92+
93+
SWIFT支持在训练过程中使用EvalScope对当前的模型进行评测,以便及时了解模型的训练效果。
94+
95+
**基本示例**
96+
97+
```shell
98+
CUDA_VISIBLE_DEVICES=0 \
99+
swift sft \
100+
--model "Qwen/Qwen2.5-0.5B-Instruct" \
101+
--train_type "lora" \
102+
--dataset "AI-ModelScope/alpaca-gpt4-data-zh#100" \
103+
--torch_dtype "bfloat16" \
104+
--num_train_epochs "1" \
105+
--per_device_train_batch_size "1" \
106+
--learning_rate "1e-4" \
107+
--lora_rank "8" \
108+
--lora_alpha "32" \
109+
--target_modules "all-linear" \
110+
--gradient_accumulation_steps "16" \
111+
--save_steps "50" \
112+
--save_total_limit "5" \
113+
--logging_steps "5" \
114+
--max_length "2048" \
115+
--eval_strategy "steps" \
116+
--eval_steps "5" \
117+
--per_device_eval_batch_size "5" \
118+
--eval_use_evalscope \
119+
--eval_datasets "gsm8k" \
120+
--eval_datasets_args '{"gsm8k": {"few_shot_num": 0}}' \
121+
--eval_limit "10"
122+
```
123+
124+
注意启动命令为`sft`,其中eval相关的参数有:
125+
- eval_strategy: 评估策略。默认为None,跟随`save_strategy`的策略
126+
- eval_steps: 默认为None,如果存在评估数据集,则跟随`save_steps`的策略
127+
- eval_use_evalscope: 是否使用evalscope进行评测,需要设置该参数来开启评测
128+
- eval_datasets: 评测数据集,可设置多个数据集,用空格分割
129+
- eval_datasets_args: 评测数据集参数,json格式,可设置多个数据集的参数
130+
- eval_limit: 评测数据集采样数
131+
- eval_generation_config: 评测时模型推理配置,json格式,默认为`{'max_tokens': 512}`
132+
133+
91134
更多评测的样例可以参考[examples](https://github.com/modelscope/ms-swift/tree/main/examples/eval)
92135

93136
## 自定义评测集

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,6 +337,11 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
337337
- temperature: Generation parameter override. The temperature setting when `predict_with_generate=True`, defaulting to 0.
338338
- optimizer: Custom optimizer name for the plugin, defaults to None.
339339
- metric: Custom metric name for the plugin. Defaults to None, with the default set to 'acc' when `predict_with_generate=False` and 'nlg' when `predict_with_generate=True`.
340+
- eval_use_evalscope: Whether to use evalscope for evaluation, this parameter needs to be set to enable evaluation, refer to [example](../Instruction/Evaluation.md#evaluation-during-training). Default is False.
341+
- eval_datasets: Evaluation datasets, multiple datasets can be set, separated by spaces
342+
- eval_datasets_args: Evaluation dataset parameters in JSON format, parameters for multiple datasets can be set
343+
- eval_limit: Number of samples from the evaluation dataset
344+
- eval_generation_config: Model inference configuration during evaluation, in JSON format, default is `{'max_tokens': 512}`
340345

341346
### RLHF Arguments
342347

docs/source_en/Instruction/Evaluation.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,48 @@ Where:
8888

8989
For a specific list of evaluation parameters, please refer to [here](./Command-line-parameters.md#evaluation-arguments).
9090

91+
## Evaluation During Training
92+
93+
SWIFT supports using EvalScope to evaluate the current model during the training process, allowing for timely understanding of the model's training effectiveness.
94+
95+
**Basic Example**
96+
97+
```shell
98+
CUDA_VISIBLE_DEVICES=0 \
99+
swift sft \
100+
--model "Qwen/Qwen2.5-0.5B-Instruct" \
101+
--train_type "lora" \
102+
--dataset "AI-ModelScope/alpaca-gpt4-data-zh#100" \
103+
--torch_dtype "bfloat16" \
104+
--num_train_epochs "1" \
105+
--per_device_train_batch_size "1" \
106+
--learning_rate "1e-4" \
107+
--lora_rank "8" \
108+
--lora_alpha "32" \
109+
--target_modules "all-linear" \
110+
--gradient_accumulation_steps "16" \
111+
--save_steps "50" \
112+
--save_total_limit "5" \
113+
--logging_steps "5" \
114+
--max_length "2048" \
115+
--eval_strategy "steps" \
116+
--eval_steps "5" \
117+
--per_device_eval_batch_size "5" \
118+
--eval_use_evalscope \
119+
--eval_datasets "gsm8k" \
120+
--eval_datasets_args '{"gsm8k": {"few_shot_num": 0}}' \
121+
--eval_limit "10"
122+
```
123+
124+
Note that the launch command is `sft`, and the evaluation-related parameters include:
125+
- eval_strategy: Evaluation strategy. Defaults to None, following the `save_strategy` policy
126+
- eval_steps: Defaults to None. If an evaluation dataset exists, it follows the `save_steps` policy
127+
- eval_use_evalscope: Whether to use evalscope for evaluation, this parameter needs to be set to enable evaluation
128+
- eval_datasets: Evaluation datasets, multiple datasets can be set, separated by spaces
129+
- eval_datasets_args: Evaluation dataset parameters in JSON format, parameters for multiple datasets can be set
130+
- eval_limit: Number of samples from the evaluation dataset
131+
- eval_generation_config: Model inference configuration during evaluation, in JSON format, default is `{'max_tokens': 512}`
132+
91133
More evaluation examples can be found in [examples](https://github.com/modelscope/ms-swift/tree/main/examples/eval).
92134

93135
## Custom Evaluation Datasets

examples/eval/train_eval/train.sh

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
CUDA_VISIBLE_DEVICES=0 \
2+
swift sft \
3+
--model "Qwen/Qwen2.5-0.5B-Instruct" \
4+
--train_type "lora" \
5+
--dataset "AI-ModelScope/alpaca-gpt4-data-zh#100" \
6+
--torch_dtype "bfloat16" \
7+
--num_train_epochs "1" \
8+
--per_device_train_batch_size "1" \
9+
--learning_rate "1e-4" \
10+
--lora_rank "8" \
11+
--lora_alpha "32" \
12+
--target_modules "all-linear" \
13+
--gradient_accumulation_steps "16" \
14+
--save_steps "50" \
15+
--save_total_limit "5" \
16+
--logging_steps "5" \
17+
--max_length "2048" \
18+
--eval_strategy "steps" \
19+
--eval_steps "5" \
20+
--per_device_eval_batch_size "5" \
21+
--eval_use_evalscope \
22+
--eval_datasets "gsm8k" \
23+
--eval_datasets_args '{"gsm8k": {"few_shot_num": 0}}' \
24+
--eval_limit "10"

swift/llm/eval/utils.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
from dataclasses import asdict
2+
from typing import Any, Dict, List, Union
3+
4+
import torch.nn as nn
5+
from evalscope.models.custom import CustomModel
6+
from transformers import PreTrainedModel
7+
8+
from ..infer import PtEngine, RequestConfig
9+
from ..template import InferRequest
10+
11+
12+
class EvalModel(CustomModel):
13+
14+
def __init__(self, model: Union[PreTrainedModel, nn.Module], template, max_batch_size, model_name: str,
15+
**kwargs) -> None:
16+
super().__init__(config={'model_id': model_name}, **kwargs)
17+
self.model_name = model_name
18+
self.model = model
19+
self.template = template
20+
self.engine = PtEngine.from_model_template(model, template, max_batch_size=max_batch_size)
21+
22+
def predict(self, prompts: List[dict], **kwargs) -> List[Dict[str, Any]]:
23+
# use origin inputs
24+
infer_requests = self.prepare_inputs(kwargs.get('origin_inputs', prompts))
25+
26+
infer_cfg = kwargs['infer_cfg'].copy()
27+
generation_config = RequestConfig(**infer_cfg)
28+
29+
response = self.engine.infer(infer_requests=infer_requests, request_config=generation_config, use_tqdm=False)
30+
dict_response = [asdict(item) for item in response]
31+
return dict_response
32+
33+
def prepare_inputs(self, prompts: Union[List[dict], List[str]]) -> List[InferRequest]:
34+
infer_requests = []
35+
for input_item in prompts:
36+
if isinstance(input_item, str):
37+
query = input_item
38+
system_prompt = None
39+
else:
40+
data: list = input_item['data']
41+
if isinstance(data[0], tuple): # for truthful_qa and hellaswag
42+
query = '\n'.join(''.join(item) for item in data)
43+
system_prompt = input_item.get('system_prompt', None)
44+
else:
45+
query = data[0]
46+
system_prompt = input_item.get('system_prompt', None)
47+
# prepare messages
48+
messages = []
49+
if system_prompt:
50+
messages.append({'role': 'system', 'content': system_prompt})
51+
messages.append({'role': 'user', 'content': query})
52+
infer_requests.append(InferRequest(messages=messages))
53+
return infer_requests

swift/trainers/arguments.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,13 @@ class TrainArgumentsMixin:
4343
fsdp_num: int = 1
4444
acc_steps: int = 1
4545

46+
# train-eval loop args
47+
eval_use_evalscope: bool = False
48+
eval_datasets: List[str] = field(default_factory=list)
49+
eval_limit: Optional[int] = None
50+
eval_datasets_args: Optional[Union[str, dict]] = None
51+
eval_generation_config: Optional[Union[str, dict]] = None
52+
4653
def _fix_gradient_checkpointing(self):
4754
# fix use_reentrant
4855
if hasattr(torch.utils.checkpoint, '_old_checkpoint'): # avoid double patching
@@ -75,6 +82,15 @@ def __post_init__(self):
7582
if getattr(self, 'gradient_checkpointing_kwargs', None):
7683
self.gradient_checkpointing_kwargs = ModelArguments.parse_to_dict(self.gradient_checkpointing_kwargs)
7784
self._fix_gradient_checkpointing()
85+
86+
if self.eval_use_evalscope:
87+
try:
88+
import evalscope
89+
except ImportError:
90+
raise ImportError('evalscope is not installed, please install it by `pip install evalscope`')
91+
self.eval_datasets_args = ModelArguments.parse_to_dict(self.eval_datasets_args)
92+
self.eval_generation_config = ModelArguments.parse_to_dict(self.eval_generation_config)
93+
7894
super().__post_init__()
7995

8096

swift/trainers/mixin.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,9 @@ def _maybe_log_save_evaluate(self, tr_loss, *args, **kwargs):
340340
self._globalstep_last_logged = self.state.global_step
341341
self.store_flos()
342342
self.log(logs)
343+
344+
if self.args.eval_use_evalscope and self.control.should_evaluate:
345+
self._evalscope_eval()
343346
super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
344347

345348
def create_optimizer_and_scheduler(self, num_training_steps: int):
@@ -382,3 +385,32 @@ def _compute_acc(self, outputs, labels) -> None:
382385
if k not in self._custom_metrics:
383386
self._custom_metrics[k] = MeanMetric(nan_value=None)
384387
self._custom_metrics[k].update(v)
388+
389+
@torch.no_grad()
390+
def _evalscope_eval(self):
391+
from ..llm.eval.utils import EvalModel
392+
from evalscope import TaskConfig, run_task
393+
from evalscope.constants import EvalType
394+
395+
self.model.eval()
396+
max_batch_size = self.args.per_device_eval_batch_size
397+
custom_model = EvalModel(
398+
self.model, self.template, max_batch_size=max_batch_size, model_name=f'model-step{self.state.global_step}')
399+
task_config = TaskConfig(
400+
model=custom_model,
401+
eval_type=EvalType.CUSTOM,
402+
datasets=self.args.eval_datasets,
403+
dataset_args=self.args.eval_datasets_args,
404+
limit=self.args.eval_limit,
405+
work_dir=os.path.join(self.args.output_dir, 'eval'),
406+
eval_batch_size=max_batch_size,
407+
generation_config=self.args.eval_generation_config or {'max_tokens': 512},
408+
)
409+
# start evaluation
410+
eval_report = run_task(task_config)
411+
# convert to dict
412+
eval_dict = {f'test_{k}': v.score for k, v in eval_report.items()}
413+
self.log(eval_dict)
414+
415+
self.model.train()
416+
return eval_dict

tests/train/test_train_eval.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import os
2+
3+
kwargs = {
4+
'per_device_train_batch_size': 5,
5+
'save_steps': 5,
6+
'gradient_accumulation_steps': 1,
7+
'num_train_epochs': 1,
8+
}
9+
10+
11+
def test_train_eval_loop():
12+
os.environ['CUDA_VISIBLE_DEVICES'] = '0,2'
13+
from swift.llm import sft_main, TrainArguments
14+
sft_main(
15+
TrainArguments(
16+
model='Qwen/Qwen2.5-0.5B-Instruct',
17+
dataset=['AI-ModelScope/alpaca-gpt4-data-zh#100'],
18+
target_modules=['all-linear', 'all-embedding'],
19+
modules_to_save=['all-embedding', 'all-norm'],
20+
eval_strategy='steps',
21+
eval_steps=5,
22+
per_device_eval_batch_size=5,
23+
eval_use_evalscope=True,
24+
eval_datasets=['gsm8k'],
25+
eval_datasets_args={'gsm8k': {
26+
'few_shot_num': 0
27+
}},
28+
eval_limit=10,
29+
report_to=['wandb'],
30+
**kwargs))
31+
32+
33+
if __name__ == '__main__':
34+
test_train_eval_loop()

0 commit comments

Comments
 (0)