Skip to content

Pipeline Parallelism Across Nodes Fails with EOFError #1852

@rrutmann

Description

@rrutmann

Bug description

We created this issue for pytorch, but since we're not sure if it fits better to this repo, we link it here for reference:

pytorch/pytorch#165143

Could someone, who successully ran torchtitan with pipeline parallelism on multiple nodes, please provide a working setup here that we could try?

Versions

Component Version / Info
PyTorch 2.10.0.dev20251008+cu126; 2.10.0.dev20250918+cu128 (nightly); 2.10.0.dev20250924+cu130
TorchTitan current main branch (as of Oct 2025)
CUDA 12.6; 12.8; 13.0
Python 3.10.12; 3.11.5; 3.12.3
Cluster 2× DGX nodes (tested with and without InfiniBand)
Launcher torch.distributed.run
Backend NCCL
Environment vars Default unless otherwise specified

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions