Skip to content

Inter-S1-241B 多机多卡训练通信问题 #33

@Rainnnnman

Description

@Rainnnnman

目前使用4机32卡(H200)deepspeed3配置,在llama factory框架下进行Inter-S1-241B的sft训练跑不起来,具体表现为进入到step0时会卡住 直到超时报错。配置为4机32卡,step0卡住时 单卡显存占用不到70G 利用率百分百。同配置多卡多机 intern s1 mini和qwen 235B都可以顺利训练。

model_name_or_path: /mnt/shared-storage-user/medeval-share/liuzhiqiang/model_weight/Intern-S1/snapshots/edeb08ec5a413c5cf8acad72e8833e506d4090a0
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
ddp_find_unused_parameters: false

### dataset
#dataset : 
dataset_dir: 
template: intern_s1
cutoff_len: 16384
max_samples: 100000000
overwrite_cache: false
preprocessing_num_workers: 50
dataloader_num_workers: 48
preprocessing_batch_size: 10000
tokenized_path: ""

### output
output_dir: 
overwrite_output_dir: true
logging_steps: 1
save_strategy: "epoch"
save_total_limit: 4
plot_loss: true

save_only_model: false
report_to: none  

### train
deepspeed: /LLaMA-Factory/examples/deepspeed/ds_z3_config.json
per_device_train_batch_size: 1
gradient_accumulation_steps: 64
learning_rate: 2.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.03
weight_decay: 0.1
max_grad_norm: 1.0

### mixed precision
bf16: true
fp16: false
tf32: true
ddp_timeout: 180000000
resume_from_checkpoint: null
### memory & speed
gradient_checkpointing: true
flash_attn: fa2

### eval 
do_eval: false
val_size: 0.0

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions