Skip to content

deepspeed zero-3 offload 报错 KeyError: 'bias_correction' #10286

@Ming-526

Description

@Ming-526

Reminder

  • I have read the above rules and searched the existing issues.

System Info

- `llamafactory` version: 0.9.5.dev0
- Platform: Linux-5.15.0-136-generic-x86_64-with-glibc2.35
- Python version: 3.11.11
- PyTorch version: 2.6.0+cu124 (GPU)
- Transformers version: 5.2.0
- Datasets version: 4.0.0
- Accelerate version: 1.11.0
- PEFT version: 0.18.1
- GPU type: NVIDIA H100 80GB HBM3
- GPU number: 8
- GPU memory: 79.10GB
- TRL version: 0.24.0
- DeepSpeed version: 0.18.4
- Default data directory: detected

Reproduction

## model
model_name_or_path: /app/models/DeepSeek-R1-Distill-Qwen-1.5B
trust_remote_code: True
flash_attn: auto
packing: False 
enable_thinking: True

### method
stage: sft
do_train: True
finetuning_type: lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target: all

### dataset
dataset_dir: XXX
dataset: XXX
template: deepseekr1
cutoff_len: 65536
max_samples: 100000
preprocessing_num_workers: 16

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 0.0005
num_train_epochs: 1.0
lr_scheduler_type: cosine
max_grad_norm: 1.0
warmup_steps: 20
bf16: True
ddp_timeout: 180000000
gradient_checkpointing: true

include_num_input_tokens_seen: True
optim: adamw_torch

### eval

val_size: 0.05
eval_strategy: steps
eval_steps: 10
per_device_eval_batch_size: 1
deepspeed: /app/examples/deepspeed/ds_z3_offload_config.json

deepspeed:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": false,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

报错:

[rank4]:   File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank4]:     run_exp()
[rank4]:   File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank4]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank4]:   File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank4]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank4]:   File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank4]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank4]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank4]:     return inner_training_loop(
[rank4]:            ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank4]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank4]:     self.accelerator.backward(loss, **kwargs)
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank4]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank4]:     self.engine.step()
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank4]:     self._take_model_step(lr_kwargs)
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank4]:     self.optimizer.step()
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:               ^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank4]:     self._optimizer_step(sub_group_id)
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank4]:     step_with_gradscaler(self.optimizer)
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank4]:     optimizer.step()
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank4]:     out = func(*args, **kwargs)
[rank4]:           ^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank4]:     group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank4]:                            ~~~~~^^^^^^^^^^^^^^^^^^^
[rank4]: KeyError: 'bias_correction'
swanlab: Error happened while training
[rank3]: Traceback (most recent call last):
[rank3]:   File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank3]:     run_exp()
[rank3]:   File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank3]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank3]:   File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]:   File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank3]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank3]:     self.accelerator.backward(loss, **kwargs)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank3]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank3]:     self.engine.step()
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank3]:     self._take_model_step(lr_kwargs)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank3]:     self.optimizer.step()
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank3]:     self._optimizer_step(sub_group_id)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank3]:     step_with_gradscaler(self.optimizer)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank3]:     optimizer.step()
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank3]:     out = func(*args, **kwargs)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank3]:     group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank3]:                            ~~~~~^^^^^^^^^^^^^^^^^^^
[rank3]: KeyError: 'bias_correction'
[rank2]: Traceback (most recent call last):
[rank2]:   File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank2]:     run_exp()
[rank2]:   File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank2]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank2]:   File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]:   File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank2]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank2]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank2]:     self.accelerator.backward(loss, **kwargs)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank2]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank2]:     self.engine.step()
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank2]:     self._take_model_step(lr_kwargs)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank2]:     self.optimizer.step()
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank2]:     ret_val = func(*args, **kwargs)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank2]:     self._optimizer_step(sub_group_id)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank2]:     step_with_gradscaler(self.optimizer)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank2]:     optimizer.step()
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank2]:     out = func(*args, **kwargs)
[rank2]:           ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank2]:     group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank2]:                            ~~~~~^^^^^^^^^^^^^^^^^^^
[rank2]: KeyError: 'bias_correction'
[rank5]: Traceback (most recent call last):
[rank5]:   File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank5]:     run_exp()
[rank5]:   File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank5]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank5]:   File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank5]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank5]:   File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank5]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank5]:     return inner_training_loop(
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank5]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank5]:     self.accelerator.backward(loss, **kwargs)
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank5]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank5]:     self.engine.step()
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank5]:     self._take_model_step(lr_kwargs)
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank5]:     self.optimizer.step()
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:               ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank5]:     self._optimizer_step(sub_group_id)
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank5]:     step_with_gradscaler(self.optimizer)
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank5]:     optimizer.step()
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank5]:     out = func(*args, **kwargs)
[rank5]:           ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank5]:     group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank5]:                            ~~~~~^^^^^^^^^^^^^^^^^^^
[rank5]: KeyError: 'bias_correction'
[rank6]: Traceback (most recent call last):
[rank6]:   File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank6]:     run_exp()
[rank6]:   File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank6]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank6]:   File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank6]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank6]:   File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank6]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank6]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank6]:     return inner_training_loop(
[rank6]:            ^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank6]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank6]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank6]:     self.accelerator.backward(loss, **kwargs)
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank6]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank6]:     self.engine.step()
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank6]:     self._take_model_step(lr_kwargs)
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank6]:     self.optimizer.step()
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:               ^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank6]:     self._optimizer_step(sub_group_id)
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank6]:     step_with_gradscaler(self.optimizer)
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank6]:     optimizer.step()
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank6]:     out = func(*args, **kwargs)
[rank6]:           ^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank6]:     group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank6]:                            ~~~~~^^^^^^^^^^^^^^^^^^^
[rank6]: KeyError: 'bias_correction'
[rank7]: Traceback (most recent call last):
[rank7]:   File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank7]:     run_exp()
[rank7]:   File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank7]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank7]:   File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank7]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank7]:   File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank7]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank7]:     return inner_training_loop(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank7]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank7]:     self.accelerator.backward(loss, **kwargs)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank7]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank7]:     self.engine.step()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank7]:     self._take_model_step(lr_kwargs)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank7]:     self.optimizer.step()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:               ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank7]:     self._optimizer_step(sub_group_id)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank7]:     step_with_gradscaler(self.optimizer)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank7]:     optimizer.step()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank7]:     out = func(*args, **kwargs)
[rank7]:           ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank7]:     group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank7]:                            ~~~~~^^^^^^^^^^^^^^^^^^^
[rank7]: KeyError: 'bias_correction'
swanlab: 🏠 View project at https://swanlab.cn/@Ming_yr/wenyi
swanlab: 🚀 View run at 
https://swanlab.cn/@Ming_yr/wenyi/runs/c6tdll02q3h51aqcocobw
[rank1]: Traceback (most recent call last):
[rank1]:   File "/app/src/llamafactory/launcher.py", line 185, in <module>
[rank1]:     run_exp()
[rank1]:   File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank1]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank1]:   File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]:   File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank1]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
[rank1]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank1]:     self.engine.step()
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
[rank1]:     self._take_model_step(lr_kwargs)
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
[rank1]:     self.optimizer.step()
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
[rank1]:     self._optimizer_step(sub_group_id)
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
[rank1]:     step_with_gradscaler(self.optimizer)
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
[rank1]:     optimizer.step()
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
[rank1]:     group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
[rank1]:                            ~~~~~^^^^^^^^^^^^^^^^^^^
[rank1]: KeyError: 'bias_correction'
  0%|          | 0/359 [00:00<?, ?it/s]
  File "/app/src/llamafactory/launcher.py", line 185, in <module>
    run_exp()
  File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1979, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2732, in backward
    self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
    self.engine.step()
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2690, in step
    self._take_model_step(lr_kwargs)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2585, in _take_model_step
    self.optimizer.step()
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2217, in step
    self._optimizer_step(sub_group_id)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1070, in _optimizer_step
    step_with_gradscaler(self.optimizer)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1062, in step_with_gradscaler
    optimizer.step()
  File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 164, in step
    group['weight_decay'], group['bias_correction'], p.data, p.grad.data,
                           ~~~~~^^^^^^^^^^^^^^^^^^^
'bias_correction'
  0%|          | 0/359 [01:13<?, ?it/s]
W0316 06:57:24.069000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164765 closing signal SIGTERM
W0316 06:57:24.072000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164766 closing signal SIGTERM
W0316 06:57:24.072000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164767 closing signal SIGTERM
W0316 06:57:24.073000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164768 closing signal SIGTERM
W0316 06:57:24.073000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164769 closing signal SIGTERM
W0316 06:57:24.073000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164770 closing signal SIGTERM
W0316 06:57:24.074000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164771 closing signal SIGTERM
E0316 06:57:33.066000 164636 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 7 (pid: 164772) of binary: /opt/conda/bin/python3.11
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/app/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-03-16_06:57:24
  host      : h1001
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 164772)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/opt/conda/bin/llamafactory-cli", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/app/src/llamafactory/cli.py", line 24, in main
    launcher.launch()
  File "/app/src/llamafactory/launcher.py", line 115, in launch
    process = subprocess.run(
              ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '8', '--master_addr', '127.0.0.1', '--master_port', '59997', '/app/src/llamafactory/launcher.py', '/app/run_scripts/xxx.yaml']' returned non-zero exit status 1.

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions