-
Notifications
You must be signed in to change notification settings - Fork 570
Open
Description
Bug description
We created this issue for pytorch, but since we're not sure if it fits better to this repo, we link it here for reference:
Could someone, who successully ran torchtitan with pipeline parallelism on multiple nodes, please provide a working setup here that we could try?
Versions
Component | Version / Info |
---|---|
PyTorch | 2.10.0.dev20251008+cu126; 2.10.0.dev20250918+cu128 (nightly); 2.10.0.dev20250924+cu130 |
TorchTitan | current main branch (as of Oct 2025) |
CUDA | 12.6; 12.8; 13.0 |
Python | 3.10.12; 3.11.5; 3.12.3 |
Cluster | 2× DGX nodes (tested with and without InfiniBand) |
Launcher | torch.distributed.run |
Backend | NCCL |
Environment vars | Default unless otherwise specified |