Commit 401702a
Fix hardcoded local_world_size in dynamic resharding (meta-pytorch#4000)
Summary:
Pull Request resolved: meta-pytorch#4000
`_prepare_shard_distribution_comm_ops()` hardcoded `_is_intra_comm(src, dst, 8)`
to classify P2P communication as intra-node vs inter-node. This is wrong on
hardware with `local_world_size != 8` (e.g. GB200_HP with `local_world_size=2`),
causing inter-host P2P ops to be misclassified as intra-node.
Added `local_world_size` parameter (default 8 for backward compatibility).
Callers can now pass the actual value from their topology or process group.
Reviewed By: kausv
Differential Revision: D98986450
fbshipit-source-id: 66e04aea46e643cfbfe764f160a063364408ad101 parent 21e7aaf commit 401702a
1 file changed
Lines changed: 5 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
242 | 243 | | |
243 | 244 | | |
244 | 245 | | |
| 246 | + | |
245 | 247 | | |
246 | 248 | | |
247 | 249 | | |
| |||
294 | 296 | | |
295 | 297 | | |
296 | 298 | | |
297 | | - | |
| 299 | + | |
298 | 300 | | |
299 | 301 | | |
300 | 302 | | |
| |||
381 | 383 | | |
382 | 384 | | |
383 | 385 | | |
384 | | - | |
| 386 | + | |
385 | 387 | | |
386 | 388 | | |
387 | 389 | | |
| |||
796 | 798 | | |
797 | 799 | | |
798 | 800 | | |
| 801 | + | |
799 | 802 | | |
800 | 803 | | |
801 | 804 | | |
| |||
0 commit comments