Skip to content

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #10261

@ImsuperSH

Description

@ImsuperSH

Reminder

  • I have read the above rules and searched the existing issues.

System Info

I using docker finetuning qwen3vl-4b (lora), with 4 * A100(40G). Error is RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR, I reduce the max_pixels but also occurs. Any ideas?

yaml:

model

model_name_or_path: /data/model/qwen3-VL-4B-Instruct

(2048, 1536, 3)

image_max_pixels: 262144 # 4000000
video_max_pixels: 16384
trust_remote_code: true
use_fast_tokenizer: false

method

stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

dataset

dataset: shicai_train
eval_dataset: shicai_test
template: qwen3_vl_nothink
cutoff_len: 4096
max_samples: 10000000
overwrite_cache: true
preprocessing_num_workers: 8
dataloader_num_workers: 4

output

output_dir: saves/qwen3vl-4b/shicai-1
logging_steps: 10
save_steps: 10000000
plot_loss: true
overwrite_output_dir: true
save_only_model: true
report_to: tensorboard # choices: [none, wandb, tensorboard, swanlab, mlflow]

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true

fp16: true

ddp_timeout: 180000000
resume_from_checkpoint: null

eval

#val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: epoch
#eval_steps: 500

Error:
[INFO|trainer.py:1587] 2026-03-09 03:37:20,261 >> ***** Running training *****
[INFO|trainer.py:1588] 2026-03-09 03:37:20,261 >> Num examples = 7,253
[INFO|trainer.py:1589] 2026-03-09 03:37:20,261 >> Num Epochs = 5
[INFO|trainer.py:1590] 2026-03-09 03:37:20,261 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1593] 2026-03-09 03:37:20,261 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1594] 2026-03-09 03:37:20,261 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1595] 2026-03-09 03:37:20,261 >> Total optimization steps = 2,270
[INFO|trainer.py:1596] 2026-03-09 03:37:20,270 >> Number of trainable parameters = 16,515,072
0%| | 0/2270 [00:00<?, ?it/s][rank3]: Traceback (most recent call last):
[rank3]: File "/app/src/llamafactory/launcher.py", line 185, in
[rank3]: run_exp()
[rank3]: File "/app/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank3]: _training_function(config={"args": args, "callbacks": callbacks})
[rank3]: File "/app/src/llamafactory/train/tuner.py", line 93, in _training_function
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/app/src/llamafactory/train/sft/workflow.py", line 139, in run_sft
[rank3]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1412, in train
[rank3]: return inner_training_loop(
[rank3]: ^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
[rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1951, in training_step
[rank3]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/app/src/llamafactory/train/sft/trainer.py", line 162, in compute_loss
[rank3]: return super().compute_loss(model, inputs, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2022, in compute_loss
[rank3]: outputs = model(**inputs)
[rank3]: ^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
[rank3]: else self._run_ddp_forward(*inputs, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
[rank3]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 819, in forward
[rank3]: return model_forward(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 807, in call
[rank3]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/peft/peft_model.py", line 1923, in forward
[rank3]: return self.base_model(
[rank3]: ^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/peft/tuners/tuners_utils.py", line 311, in forward
[rank3]: return self.model.forward(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/utils/generic.py", line 841, in wrapper
[rank3]: output = func(self, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1439, in forward
[rank3]: outputs = self.model(
[rank3]: ^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/utils/generic.py", line 841, in wrapper
[rank3]: output = func(self, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1219, in forward
[rank3]: image_outputs: BaseModelOutputWithDeepstackFeatures = self.get_image_features(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/utils/generic.py", line 841, in wrapper
[rank3]: output = func(self, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1099, in get_image_features
[rank3]: vision_output: BaseModelOutputWithDeepstackFeatures = self.visual(
[rank3]: ^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/utils/generic.py", line 915, in wrapper
[rank3]: output = func(self, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/utils/output_capturing.py", line 253, in wrapper
[rank3]: outputs = func(self, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 776, in forward
[rank3]: hidden_states = self.patch_embed(hidden_states)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 87, in forward
[rank3]: hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 725, in forward
[rank3]: return self._conv_forward(input, self.weight, self.bias)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 720, in _conv_forward
[rank3]: return F.conv3d(
[rank3]: ^^^^^^^^^
[rank3]: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Reproduction

Put your message here.

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions