You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[PyTorch](https://pytorch.org/docs/stable/fsdp.html) (originally it was implemented in [FairScale](https://github.com/facebookresearch/fairscale/) and later it was upstreamed into the PyTorch core)
-[Pipeline-Parallelism: Distributed Training via Model Partitioning](https://siboehm.com/articles/22/pipeline-parallel-training)
299
+
293
300
294
301
295
302
## Tensor Parallelism
@@ -316,9 +323,9 @@ Parallelizing the multi-headed attention layers is even simpler, since they are
316
323
317
324
Important: TP requires very fast network, and therefore since typically intra-node networks are much faster than inter-node networks it's not advisable to do TP across nodes. Practically, if a node has 4 GPUs, the highest TP degree is therefore 4. If you need a TP degree of 8, you need to use nodes that have at least 8 GPUs.
318
325
319
-
Important: TP degree shouldn't span across nodes. For example if the node has 8 gpus, TP degree should be no more than 8.
326
+
TP can be combined with other parallelization methods.
320
327
321
-
TP can combined with other parallelization methods.
328
+
One of the deficiencies of TP is that it's difficult to overlap the comms with compute. PyTorch is proposing to overcome this with [Async-TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487) which decomposes the dependent sequence of all-gather + matmul into series of cudaMemcpyAsync calls and smaller partial matmuls - and it does it automatically for you using `torch.compile`!
322
329
323
330
Alternative names:
324
331
- DeepSpeed calls it [tensor slicing](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
@@ -330,6 +337,18 @@ Implementations:
330
337
-[OSLO](https://github.com/eleutherAI/Oslo) has the tensor parallelism implementation based on the Transformers.
-[Tensor Parallelism and Sequence Parallelism: Detailed Analysis](https://insujang.github.io/2024-01-11/tensor-parallelism-and-sequence-parallelism-detailed-analysis/#sequence-parallelism)
347
+
348
+
## TP+SP
349
+
350
+
TP can be combined with SP in the same process group to minimize communication costs as explained in [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198) - TP is used for attention and linear layers and when dropout and layer norm is reached SP is used instead.
@@ -388,6 +409,7 @@ And since we have ZeRO, the other benefit is ZeRO-Offload. Since this is stage 1
388
409
Implementations:
389
410
-[Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) and [Megatron-Deepspeed from BigScience](https://github.com/bigscience-workshop/Megatron-DeepSpeed), which is the fork of the former repo.
PyTorch is also working on this feature and calling it Context Parallel (CP).
471
494
495
+
### DistFlashAttn
496
+
497
+
[DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training](https://arxiv.org/abs/2310.03294) is reported to be many times faster than Ring Self-Attention, because it load balances the KVQ per token computation between the workers while performing Sequence Parallelism.
498
+
499
+

500
+
501
+
### Related reading
502
+
503
+
-[Tensor Parallelism and Sequence Parallelism: Detailed Analysis](https://insujang.github.io/2024-01-11/tensor-parallelism-and-sequence-parallelism-detailed-analysis/#sequence-parallelism)
0 commit comments