-
Notifications
You must be signed in to change notification settings - Fork 346
Commit d95dff9
committed
Update base for Update on "[WIP][RFC] TorchFT integration"
**Summary**
This is a WIP TorchFT integration PR.
**Current Issues**
This doesn't work at this moment as there are hanged groups when a new group joins.
**Issue 1:**
~Group 0 and group 1 will hang during the first `should_commit` after group 1 applying the pending state_dict from group 0.~
Fixed with: pytorch/torchft#83
**Issue 2:**
~Group 0 and group 1 will pass the `should_commit` but group 0 needs healing which is wrong and the healing process will cause another hang.~
Fixed with: pytorch/torchft#83
**Issue 3:**
~The byproduct of issue 1 and issue 2: group 1 will continue to print out~
```
[rank0]:devgpu051:76838:80357 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer devgpu051.cln3.svc.fbinfra.net<33618>
```
Fixed with pytorch/torchft#91 and several other fixes.
**Issue 4:**
When there are 3 groups, everyone requests the state dict every step.
***How to reproduce?***
Using the `Reproduce steps` to run 2 groups, then add another group by modifying the command.
Seems to be fixed, will need more tests.
**Issue 5:**
Hang will happen if using functional collective.
***How to reproduce?***
Pull the latest version of this PR and comment out line 41 and uncomment line 42 in `torchtitan/utils.py`
**Reproduce steps:**
1. Patch TorchFT with pytorch/torchft#82
2. Execute lighthouse
3. Execute the following command in one terminal:
```
TORCHFT_MANAGER_PORT=29520 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=0,1 NGPU=2 ./run_llama_train.sh --training.data_parallel_shard_degree=2 --experimental.enable_torchft --experimental.ft_replica_group_id=0
```
4. Wait 10 seconds, execute following command in another terminal:
```
TORCHFT_MANAGER_PORT=29522 REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=2,3 NGPU=2 ./run_llama_train.sh --training.data_parallel_shard_degree=2 --experimental.enable_torchft --experimental.ft_replica_group_id=1
```
[ghstack-poisoned]1 parent dc089a5 commit d95dff9Copy full SHA for d95dff9
File tree
0 file changed
+0
-0
lines changedFilter options
0 file changed
+0
-0
lines changed
0 commit comments