Skip to content

Commit 6b06a5a

Browse files
committed
new
1 parent c00267b commit 6b06a5a

File tree

2 files changed

+34
-2
lines changed

2 files changed

+34
-2
lines changed

training/model-parallelism/README.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ If you pay close attention the way ZeRO partitions the model's weights - it look
117117
Implementations of ZeRO-DP stages 1+2+3:
118118
- [DeepSpeed](https://www.deepspeed.ai/tutorials/zero/)
119119
- [PyTorch](https://pytorch.org/docs/stable/fsdp.html) (originally it was implemented in [FairScale](https://github.com/facebookresearch/fairscale/) and later it was upstreamed into the PyTorch core)
120+
- [torchtitan](https://github.com/pytorch/torchtitan)
120121

121122
Deepspeed ZeRO Integration:
122123
- [HF Trainer integration](https://huggingface.co/docs/transformers/main_classes/deepspeed)
@@ -128,6 +129,7 @@ FSDP Integration:
128129
- [HF Trainer integration](https://huggingface.co/docs/transformers/main/en/fsdp)
129130
- [Accelerate](https://huggingface.co/docs/accelerate/main/en/usage_guides/fsdp)
130131
- [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html)
132+
- [torchtitan](https://github.com/pytorch/torchtitan)
131133

132134
Important papers:
133135

@@ -288,8 +290,13 @@ Implementations:
288290
- [OSLO](https://github.com/eleutherAI/Oslo) - this is implemented based on the Hugging Face Transformers.
289291
- [PiPPy: Pipeline Parallelism for PyTorch](https://github.com/pytorch/pippy) - automatic PP via `torch.fx`
290292
- [nanotron](https://github.com/huggingface/nanotron)
293+
- [torchtitan](https://github.com/pytorch/torchtitan)
291294

292295

296+
### Related reading
297+
298+
- [Pipeline-Parallelism: Distributed Training via Model Partitioning](https://siboehm.com/articles/22/pipeline-parallel-training)
299+
293300

294301

295302
## Tensor Parallelism
@@ -316,9 +323,9 @@ Parallelizing the multi-headed attention layers is even simpler, since they are
316323

317324
Important: TP requires very fast network, and therefore since typically intra-node networks are much faster than inter-node networks it's not advisable to do TP across nodes. Practically, if a node has 4 GPUs, the highest TP degree is therefore 4. If you need a TP degree of 8, you need to use nodes that have at least 8 GPUs.
318325

319-
Important: TP degree shouldn't span across nodes. For example if the node has 8 gpus, TP degree should be no more than 8.
326+
TP can be combined with other parallelization methods.
320327

321-
TP can combined with other parallelization methods.
328+
One of the deficiencies of TP is that it's difficult to overlap the comms with compute. PyTorch is proposing to overcome this with [Async-TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487) which decomposes the dependent sequence of all-gather + matmul into series of cudaMemcpyAsync calls and smaller partial matmuls - and it does it automatically for you using `torch.compile`!
322329

323330
Alternative names:
324331
- DeepSpeed calls it [tensor slicing](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
@@ -330,6 +337,18 @@ Implementations:
330337
- [OSLO](https://github.com/eleutherAI/Oslo) has the tensor parallelism implementation based on the Transformers.
331338
- [nanotron](https://github.com/huggingface/nanotron)
332339
- [parallelformers](https://github.com/tunib-ai/parallelformers) (only inference at the moment)
340+
- [torchtitan](https://github.com/pytorch/torchtitan)
341+
342+
343+
344+
### Related reading
345+
346+
- [Tensor Parallelism and Sequence Parallelism: Detailed Analysis](https://insujang.github.io/2024-01-11/tensor-parallelism-and-sequence-parallelism-detailed-analysis/#sequence-parallelism)
347+
348+
## TP+SP
349+
350+
TP can be combined with SP in the same process group to minimize communication costs as explained in [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198) - TP is used for attention and linear layers and when dropout and layer norm is reached SP is used instead.
351+
333352

334353

335354
## DP+PP
@@ -349,6 +368,7 @@ Implementations:
349368
- [SageMaker](https://arxiv.org/abs/2111.05972)
350369
- [OSLO](https://github.com/eleutherAI/Oslo)
351370
- [nanotron](https://github.com/huggingface/nanotron)
371+
- [torchtitan](https://github.com/pytorch/torchtitan)
352372

353373

354374

@@ -369,6 +389,7 @@ Implementations:
369389
- [SageMaker](https://arxiv.org/abs/2111.05972)
370390
- [OSLO](https://github.com/eleutherAI/Oslo)
371391
- [nanotron](https://github.com/huggingface/nanotron)
392+
- [torchtitan](https://github.com/pytorch/torchtitan)
372393

373394

374395
## ZeRO DP+PP+TP
@@ -388,6 +409,7 @@ And since we have ZeRO, the other benefit is ZeRO-Offload. Since this is stage 1
388409
Implementations:
389410
- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) and [Megatron-Deepspeed from BigScience](https://github.com/bigscience-workshop/Megatron-DeepSpeed), which is the fork of the former repo.
390411
- [OSLO](https://github.com/eleutherAI/Oslo)
412+
- [torchtitan](https://github.com/pytorch/torchtitan)
391413

392414
Important papers:
393415

@@ -466,9 +488,19 @@ SP Implementations:
466488
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
467489
- [Deepspeed](https://github.com/microsoft/DeepSpeed)
468490
- [Colossal-AI](https://colossalai.org/)
491+
- [torchtitan](https://github.com/pytorch/torchtitan)
469492

470493
PyTorch is also working on this feature and calling it Context Parallel (CP).
471494

495+
### DistFlashAttn
496+
497+
[DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training](https://arxiv.org/abs/2310.03294) is reported to be many times faster than Ring Self-Attention, because it load balances the KVQ per token computation between the workers while performing Sequence Parallelism.
498+
499+
![distflashattn](images/dist-flash-attn.png)
500+
501+
### Related reading
502+
503+
- [Tensor Parallelism and Sequence Parallelism: Detailed Analysis](https://insujang.github.io/2024-01-11/tensor-parallelism-and-sequence-parallelism-detailed-analysis/#sequence-parallelism)
472504

473505

474506
## Expert Parallelism
Loading

0 commit comments

Comments
 (0)