Inter-S1-241B 多机多卡训练通信问题

目前使用4机32卡（H200）deepspeed3配置，在llama factory框架下进行Inter-S1-241B的sft训练跑不起来，具体表现为进入到step0时会卡住 直到超时报错。配置为4机32卡，step0卡住时 单卡显存占用不到70G 利用率百分百。同配置多卡多机 intern s1 mini和qwen 235B都可以顺利训练。
```python
model_name_or_path: /mnt/shared-storage-user/medeval-share/liuzhiqiang/model_weight/Intern-S1/snapshots/edeb08ec5a413c5cf8acad72e8833e506d4090a0
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
ddp_find_unused_parameters: false

### dataset
#dataset : 
dataset_dir: 
template: intern_s1
cutoff_len: 16384
max_samples: 100000000
overwrite_cache: false
preprocessing_num_workers: 50
dataloader_num_workers: 48
preprocessing_batch_size: 10000
tokenized_path: ""

### output
output_dir: 
overwrite_output_dir: true
logging_steps: 1
save_strategy: "epoch"
save_total_limit: 4
plot_loss: true

save_only_model: false
report_to: none  

### train
deepspeed: /LLaMA-Factory/examples/deepspeed/ds_z3_config.json
per_device_train_batch_size: 1
gradient_accumulation_steps: 64
learning_rate: 2.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.03
weight_decay: 0.1
max_grad_norm: 1.0

### mixed precision
bf16: true
fp16: false
tf32: true
ddp_timeout: 180000000
resume_from_checkpoint: null
### memory & speed
gradient_checkpointing: true
flash_attn: fa2

### eval 
do_eval: false
val_size: 0.0
``` 

![Image](https://github.com/user-attachments/assets/29e5f0f6-aabd-44bf-ae98-66d9ba27b312)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inter-S1-241B 多机多卡训练通信问题 #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inter-S1-241B 多机多卡训练通信问题 #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions