Skip to content

DDP models are different when training is interrupted #276

@btian

Description

@btian

Hi folks, not sure if I'm doing anything wrong. I saw a problem where the final models across ranks are different when training is interrupted.

To reproduce:

Use the following script to launch train_ddp.py across 3 different nodes, 1 GPU per node.

pip install torchft_nightly-2025.7.27-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=180

if [ -z "${RANK}" ] || [ "${RANK}" == "0" ]; then
    RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 2 --quorum_tick_ms 100 --join_timeout_ms 10000 --bind 0.0.0.0:${PORT} &
    export TORCHFT_LIGHTHOUSE="http://localhost:${PORT}"
else
    export TORCHFT_LIGHTHOUSE="http://${MASTER_ADDR}:${PORT}"
fi

script_file="train_ddp.py"

cmd=(torchrun
    --nproc_per_node="$SLURM_GPUS_PER_NODE"
    --rdzv_backend c10d
    --rdzv_endpoint="localhost:0"
    "$script_file"
    --
    "$@")

# Print the command
echo "Executing: ${cmd[@]}"

# Execute the command
for ((i=1; i<=3; i++))
do
    "${cmd[@]}"
    if [ $? -eq 0 ]; then
        echo "Command succeeded on attempt $i"
        break
    else
        echo "Command failed on attempt $i"
        if [ $i -eq 3 ]; then
            echo "Command failed after 3 attempts"
            exit 1
        fi
        sleep 1 # Optional: wait before retry
    fi
done

In the middle of training process, SSH into one of the nodes, kill the torchrun process.

Expected result: final loss across all ranks should be the same.

Actual result: final loss on the node that experienced interruption is different from the other two nodes.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions