Commit dfcb08d
committed
Update on "[WIP][RFC] TorchFT integration"
**Summary**
This is a WIP TorchFT integration PR.
**Current Issues**
This doesn't work at this moment as there are hanged groups when a new group joins.
**Issue 1:**
~Group 0 and group 1 will hang during the first `should_commit` after group 1 applying the pending state_dict from group 0.~
Fixed with: meta-pytorch/torchft#83
**Issue 2:**
~Group 0 and group 1 will pass the `should_commit` but group 0 needs healing which is wrong and the healing process will cause another hang.~
Fixed with: meta-pytorch/torchft#83
**Issue 3:**
~The byproduct of issue 1 and issue 2: group 1 will continue to print out~
```
[rank0]:devgpu051:76838:80357 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer devgpu051.cln3.svc.fbinfra.net<33618>
```
Fixed with meta-pytorch/torchft#91 and several other fixes.
**Issue 4:**
When there are 3 groups, everyone requests the state dict every step.
***How to reproduce?***
Using the `Reproduce steps` to run 2 groups, then add another group by modifying the command.
Seems to be fixed, will need more tests.
**Issue 5:**
Hang will happen if using functional collective.
***How to reproduce?***
Pull the latest version of this PR and comment out line 41 and uncomment line 42 in `torchtitan/utils.py`
**Reproduce steps:**
1. Patch TorchFT with meta-pytorch/torchft#82
2. Execute lighthouse
3. Execute the following command in one terminal:
```
TORCHFT_MANAGER_PORT=29520 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=0,1 NGPU=2 ./run_llama_train.sh --training.data_parallel_shard_degree=2 --experimental.enable_torchft --experimental.ft_replica_group_id=0
```
4. Wait 10 seconds, execute following command in another terminal:
```
TORCHFT_MANAGER_PORT=29522 REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=2,3 NGPU=2 ./run_llama_train.sh --training.data_parallel_shard_degree=2 --experimental.enable_torchft --experimental.ft_replica_group_id=1
```
[ghstack-poisoned]2 files changed
+4
-39
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
156 | 156 | | |
157 | 157 | | |
158 | 158 | | |
159 | | - | |
160 | | - | |
161 | | - | |
162 | | - | |
163 | | - | |
164 | | - | |
165 | | - | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | | - | |
177 | | - | |
178 | | - | |
179 | | - | |
180 | | - | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | | - | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | | - | |
193 | | - | |
194 | 159 | | |
195 | 160 | | |
196 | 161 | | |
| |||
201 | 166 | | |
202 | 167 | | |
203 | 168 | | |
204 | | - | |
205 | 169 | | |
206 | 170 | | |
207 | 171 | | |
| |||
305 | 269 | | |
306 | 270 | | |
307 | 271 | | |
| 272 | + | |
308 | 273 | | |
309 | 274 | | |
310 | | - | |
311 | 275 | | |
312 | 276 | | |
313 | 277 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
408 | 408 | | |
409 | 409 | | |
410 | 410 | | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
411 | 414 | | |
412 | | - | |
413 | | - | |
414 | 415 | | |
415 | 416 | | |
416 | 417 | | |
| |||
0 commit comments