Skip to content

Commit cdc02c8

Browse files
committed
meg-lm async-tp
1 parent 1c0eaf9 commit cdc02c8

File tree

1 file changed

+7
-3
lines changed

1 file changed

+7
-3
lines changed

training/model-parallelism/README.md

+7-3
Original file line numberDiff line numberDiff line change
@@ -325,8 +325,6 @@ Important: TP requires very fast network, and therefore since typically intra-no
325325

326326
TP can be combined with other parallelization methods.
327327

328-
One of the deficiencies of TP is that it's difficult to overlap the comms with compute. PyTorch is proposing to overcome this with [Async-TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487) which decomposes the dependent sequence of all-gather + matmul into series of cudaMemcpyAsync calls and smaller partial matmuls - and it does it automatically for you using `torch.compile`!
329-
330328
Alternative names:
331329
- DeepSpeed calls it [tensor slicing](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
332330

@@ -340,14 +338,20 @@ Implementations:
340338
- [torchtitan](https://github.com/pytorch/torchtitan)
341339

342340

341+
### Async TP
342+
343+
One of the deficiencies of TP is that it's difficult to overlap its comms with compute. PyTorch is proposing to overcome this with [Async-TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487) which decomposes the dependent sequence of all-gather + matmul into series of cudaMemcpyAsync calls and smaller partial matmuls - and it does it automatically for you using `torch.compile`!
344+
345+
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has it implemented as well via`--tp-comm-overlap`.
346+
343347

344348
### Related reading
345349

346350
- [Tensor Parallelism and Sequence Parallelism: Detailed Analysis](https://insujang.github.io/2024-01-11/tensor-parallelism-and-sequence-parallelism-detailed-analysis/#sequence-parallelism)
347351

348352
## TP+SP
349353

350-
TP can be combined with SP in the same process group to minimize communication costs as explained in [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198) - TP is used for attention and linear layers and when dropout and layer norm is reached SP is used instead.
354+
TP can be combined with SP in the same process group to minimize communication costs as explained in [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198). For example in LLMs, TP is used for embedding, attention and linear layers and when dropout and layer norm are reached SP is used instead.
351355

352356

353357

0 commit comments

Comments
 (0)