Skip to content

Example train_ddp.py breaks #295

@kasakun

Description

@kasakun

Hi, I was following the guide in README to run torchft locally.

# start lighthouse
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

# start a replica in another shell
export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

# start another replica
export REPLICA_GROUP_ID=1
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

After I ran

export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

It immediately failed and I saw

------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-02_07:33:17
  host      : xxxxxxxxxxxxxxxxxxxx
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 47674)
  error_file: /mnt/tmp/torchelastic_4m_9eon3/none_ywq7poit/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/miniforge/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
      return f(*args, **kwargs)
    File "/mnt/task_runtime/train_ddp.py", line 192, in main
      loss.backward()
    File "/miniforge/lib/python3.10/site-packages/torch/_tensor.py", line 625, in backward
      torch.autograd.backward(
    File "/miniforge/lib/python3.10/site-packages/torch/autograd/__init__.py", line 354, in backward
      _engine_run_backward(
    File "/miniforge/lib/python3.10/site-packages/torch/autograd/graph.py", line 841, in _engine_run_backward
      return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    File "/mnt/task_runtime/torchft/ddp.py", line 78, in _comm_hook
      assert fut._fut
  AttributeError: 'Future' object has no attribute '_fut'

The seems only happens on the latest main: 024f850
After I reset it to 8ef24c0, I no longer see this issue.

Some extra info if it helps

python -c "import torch; print(torch.__version__)"
2.9.1+cu128

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions