-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Open
Labels
bugSomething isn't workingSomething isn't workingpendingThis problem is yet to be addressedThis problem is yet to be addressed
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
- `llamafactory` version: 0.9.5.dev0
- Platform: Linux-5.15.0-136-generic-x86_64-with-glibc2.35
- Python version: 3.11.11
- PyTorch version: 2.6.0+cu124 (GPU)
- Transformers version: 5.2.0
- Datasets version: 4.0.0
- Accelerate version: 1.11.0
- PEFT version: 0.18.1
- GPU type: NVIDIA H100 80GB HBM3
- GPU number: 8
- GPU memory: 79.10GB
- TRL version: 0.24.0
- DeepSpeed version: 0.18.4
- Default data directory: detected
Reproduction
## model
model_name_or_path: /app/models/DeepSeek-R1-Distill-Qwen-1.5B
trust_remote_code: True
flash_attn: auto
packing: False
enable_thinking: True
### method
stage: sft
do_train: True
finetuning_type: lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target: all
### dataset
dataset_dir: XXX
dataset: XXX
template: deepseekr1
cutoff_len: 65536
max_samples: 100000
preprocessing_num_workers: 16
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 0.0005
num_train_epochs: 1.0
lr_scheduler_type: cosine
max_grad_norm: 1.0
warmup_steps: 20
bf16: True
ddp_timeout: 180000000
gradient_checkpointing: true
include_num_input_tokens_seen: True
optim: adamw_torch
### eval
val_size: 0.05
eval_strategy: steps
eval_steps: 10
per_device_eval_batch_size: 1
deepspeed: /app/examples/deepspeed/ds_z3_offload_config.json
deepspeed:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": false,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
报错:
[rank4]: File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank4]: run_exp()
[rank4]: File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank4]: _training_function(config={"args": args, "callbacks": callbacks})
[rank4]: File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank4]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank4]: File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank4]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank4]: return inner_training_loop(
[rank4]: ^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank4]: self.accelerator.backward(loss, **kwargs)
[rank4]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank4]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank4]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank4]: self.engine.step()
[rank4]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank4]: self._take_model_step(lr_kwargs)
[rank4]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank4]: self.optimizer.step()
[rank4]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank4]: ret_val = func(*args, **kwargs)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank4]: self._optimizer_step(sub_group_id)
[rank4]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank4]: step_with_gradscaler(self.optimizer)
[rank4]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank4]: optimizer.step()
[rank4]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank4]: out = func(*args, **kwargs)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]: return func(*args, **kwargs)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank4]: group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank4]: ~~~~~^^^^^^^^^^^^^^^^^^^
[rank4]: KeyError: 'bias_correction'
swanlab: Error happened while training
[rank3]: Traceback (most recent call last):
[rank3]: File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank3]: run_exp()
[rank3]: File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank3]: _training_function(config={"args": args, "callbacks": callbacks})
[rank3]: File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank3]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank3]: return inner_training_loop(
[rank3]: ^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank3]: self.accelerator.backward(loss, **kwargs)
[rank3]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank3]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank3]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank3]: self.engine.step()
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank3]: self._take_model_step(lr_kwargs)
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank3]: self.optimizer.step()
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank3]: self._optimizer_step(sub_group_id)
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank3]: step_with_gradscaler(self.optimizer)
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank3]: optimizer.step()
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank3]: out = func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank3]: group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank3]: ~~~~~^^^^^^^^^^^^^^^^^^^
[rank3]: KeyError: 'bias_correction'
[rank2]: Traceback (most recent call last):
[rank2]: File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank2]: run_exp()
[rank2]: File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank2]: _training_function(config={"args": args, "callbacks": callbacks})
[rank2]: File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank2]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank2]: return inner_training_loop(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank2]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank2]: self.accelerator.backward(loss, **kwargs)
[rank2]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank2]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank2]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank2]: self.engine.step()
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank2]: self._take_model_step(lr_kwargs)
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank2]: self.optimizer.step()
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank2]: self._optimizer_step(sub_group_id)
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank2]: step_with_gradscaler(self.optimizer)
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank2]: optimizer.step()
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank2]: out = func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank2]: group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank2]: ~~~~~^^^^^^^^^^^^^^^^^^^
[rank2]: KeyError: 'bias_correction'
[rank5]: Traceback (most recent call last):
[rank5]: File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank5]: run_exp()
[rank5]: File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank5]: _training_function(config={"args": args, "callbacks": callbacks})
[rank5]: File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank5]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank5]: File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank5]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank5]: return inner_training_loop(
[rank5]: ^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank5]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank5]: self.accelerator.backward(loss, **kwargs)
[rank5]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank5]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank5]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank5]: self.engine.step()
[rank5]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank5]: self._take_model_step(lr_kwargs)
[rank5]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank5]: self.optimizer.step()
[rank5]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank5]: ret_val = func(*args, **kwargs)
[rank5]: ^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank5]: self._optimizer_step(sub_group_id)
[rank5]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank5]: step_with_gradscaler(self.optimizer)
[rank5]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank5]: optimizer.step()
[rank5]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank5]: out = func(*args, **kwargs)
[rank5]: ^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank5]: return func(*args, **kwargs)
[rank5]: ^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank5]: group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank5]: ~~~~~^^^^^^^^^^^^^^^^^^^
[rank5]: KeyError: 'bias_correction'
[rank6]: Traceback (most recent call last):
[rank6]: File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank6]: run_exp()
[rank6]: File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank6]: _training_function(config={"args": args, "callbacks": callbacks})
[rank6]: File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank6]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank6]: File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank6]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank6]: return inner_training_loop(
[rank6]: ^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank6]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank6]: self.accelerator.backward(loss, **kwargs)
[rank6]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank6]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank6]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank6]: self.engine.step()
[rank6]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank6]: self._take_model_step(lr_kwargs)
[rank6]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank6]: self.optimizer.step()
[rank6]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank6]: self._optimizer_step(sub_group_id)
[rank6]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank6]: step_with_gradscaler(self.optimizer)
[rank6]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank6]: optimizer.step()
[rank6]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank6]: out = func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank6]: group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank6]: ~~~~~^^^^^^^^^^^^^^^^^^^
[rank6]: KeyError: 'bias_correction'
[rank7]: Traceback (most recent call last):
[rank7]: File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank7]: run_exp()
[rank7]: File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank7]: _training_function(config={"args": args, "callbacks": callbacks})
[rank7]: File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank7]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank7]: File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank7]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank7]: return inner_training_loop(
[rank7]: ^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank7]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank7]: self.accelerator.backward(loss, **kwargs)
[rank7]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank7]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank7]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank7]: self.engine.step()
[rank7]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank7]: self._take_model_step(lr_kwargs)
[rank7]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank7]: self.optimizer.step()
[rank7]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank7]: self._optimizer_step(sub_group_id)
[rank7]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank7]: step_with_gradscaler(self.optimizer)
[rank7]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank7]: optimizer.step()
[rank7]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank7]: out = func(*args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]: return func(*args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank7]: group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank7]: ~~~~~^^^^^^^^^^^^^^^^^^^
[rank7]: KeyError: 'bias_correction'
swanlab: 🏠 View project at https://swanlab.cn/@Ming_yr/wenyi
swanlab: 🚀 View run at
https://swanlab.cn/@Ming_yr/wenyi/runs/c6tdll02q3h51aqcocobw
[rank1]: Traceback (most recent call last):
[rank1]: File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank1]: run_exp()
[rank1]: File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank1]: self.engine.step()
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank1]: self._take_model_step(lr_kwargs)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank1]: self.optimizer.step()
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank1]: self._optimizer_step(sub_group_id)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank1]: step_with_gradscaler(self.optimizer)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank1]: optimizer.step()
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank1]: group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank1]: ~~~~~^^^^^^^^^^^^^^^^^^^
[rank1]: KeyError: 'bias_correction'
0%| | 0/359 [00:00<?, ?it/s]
File "/app/src/llamafactory/launcher.py", line 185, in <module>
run_exp()
File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
self.accelerator.backward(loss, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
self.engine.step()
File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
self.optimizer.step()
File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
self._optimizer_step(sub_group_id)
File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
step_with_gradscaler(self.optimizer)
File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
optimizer.step()
File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
out = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
~~~~~^^^^^^^^^^^^^^^^^^^
'bias_correction'
0%| | 0/359 [01:13<?, ?it/s]
W0316 06:57:24.069000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164765 closing signal SIGTERM
W0316 06:57:24.072000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164766 closing signal SIGTERM
W0316 06:57:24.072000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164767 closing signal SIGTERM
W0316 06:57:24.073000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164768 closing signal SIGTERM
W0316 06:57:24.073000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164769 closing signal SIGTERM
W0316 06:57:24.073000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164770 closing signal SIGTERM
W0316 06:57:24.074000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164771 closing signal SIGTERM
E0316 06:57:33.066000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 7 (pid: 164772) of binary: /opt/conda/bin/python3.11
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/app/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-03-16_06:57:24
host : h1001
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 164772)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/opt/conda/bin/llamafactory-cli", line 6, in <module>
sys.exit(main())
^^^^^^
File "/app/src/llamafactory/cli.py", line 24, in main
launcher.launch()
File "/app/src/llamafactory/launcher.py", line 115, in launch
process = subprocess.run(
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '8', '--master_addr', '127.0.0.1', '--master_port', '59997', '/app/src/llamafactory/launcher.py', '/app/run_scripts/xxx.yaml']' returned non-zero exit status 1.
Others
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingpendingThis problem is yet to be addressedThis problem is yet to be addressed