[MoE][PoC] Expert Parallel: dp2ep #732

tianyu-l · 2024-12-12T03:55:13Z

Stack from ghstack (oldest at bottom):

Temporary changes to unblock exploration

[pytorch] comment out the check at https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/api.py#L66
[torchtitan] turn optimizers foreach and clip_grad_norm_ off, as not all parameters are DTensors on the same meshes (e.g. (1) MoE non-shared experts and other params are on different FSDP meshes, and (2) moe.router.gate is a replicate torch.Tensor)

Also need to

turn the block-level compile to full_graph=False because there will be an additional FSDP inside a TransformerBlock at the non shared experts level.

Things won't work

For EP + TP, DCP resharding likely would fail due to the fact that experts would "forget" they are sharded because this meta info is not tracked as part of the 1-D DTensor. This can be solved by storing a 2-D DTensor (ep + tp), but requires several code changes including strided sharding from FSDP given a 2D DTensor.

Not including

shared expert overlapping

[ghstack-poisoned]

ghstack-source-id: 17160930f23950b91faca7b822cd3e7f9d075f7d Pull Request resolved: #732

[ghstack-poisoned]

ghstack-source-id: 2a70ed917b742c32118ef5ca02f161f833ce46bc Pull Request resolved: #732

gekurian · 2025-02-13T16:25:53Z

torchtitan/config_manager.py

+                Expert parallelism degree. 1 means disabled.
+                When expert_parallel_mode is 'tp' or 'tp2ep', it has to be equal to tensor_parallel_degree.
+                When expert_parallel_mode is 'dp2ep', it has to be k * context_parallel_degree,
+                where k >= 1 and k | data_parallel_shard_degree.


This comment isn't clear.

What does k | data_parallel_shard_degree mean?

It stands for data_parallel_shard_degree % k == 0

gekurian · 2025-02-13T16:26:52Z

torchtitan/config_manager.py

-                'tp2ep' would use the entire TP mesh to shard non-shared experts on the num_experts dimension.
-            """,
+            choices=["none", "tp", "tp2ep", "dp2ep"],
+            help="Expert Parallel mode",


dp2ep here would be using the DP mesh to shard non-shared experts on the num_experts dimension? If so, could you make it clear in the comments?

dp2ep would use "the entire cp mesh (if existing) + part of dp_shard mesh (namely dp_shard_2)" to shard non-shared experts.
Sorry for the confusion -- these PRs are not meant for landing without change. We'll definitely polish the descriptions later. Reading the parallel_dims.py might be more informative for now.

[MoE][PoC] Expert Parallel: dp2ep

82bcdd4

[ghstack-poisoned]

This was referenced Dec 12, 2024

[MoE][PoC] model code #730

Draft

[MoE][PoC] Expert Parallel: tp and tp2ep #731

Draft

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 12, 2024

tianyu-l added a commit that referenced this pull request Dec 12, 2024

[MoE][PoC] Expert Parallel: dp2ep

25cfe6d

ghstack-source-id: 17160930f23950b91faca7b822cd3e7f9d075f7d Pull Request resolved: #732

This was referenced Dec 12, 2024

[PoC][MoE & EP] model code and various parallelisms #725

Closed

[PoC][MoE & EP] integrate with FSDP & CP #726

Closed

tianyu-l marked this pull request as draft December 12, 2024 04:09

Update

e619ec7

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Feb 3, 2025

[MoE][PoC] Expert Parallel: dp2ep

c24fd0b

ghstack-source-id: 2a70ed917b742c32118ef5ca02f161f833ce46bc Pull Request resolved: #732

gekurian reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE][PoC] Expert Parallel: dp2ep #732

[MoE][PoC] Expert Parallel: dp2ep #732

tianyu-l commented Dec 12, 2024 •

edited

Loading

gekurian Feb 13, 2025

tianyu-l Feb 15, 2025

gekurian Feb 13, 2025

tianyu-l Feb 15, 2025

[MoE][PoC] Expert Parallel: dp2ep #732

Are you sure you want to change the base?

[MoE][PoC] Expert Parallel: dp2ep #732

Conversation

tianyu-l commented Dec 12, 2024 • edited Loading

gekurian Feb 13, 2025

Choose a reason for hiding this comment

tianyu-l Feb 15, 2025

Choose a reason for hiding this comment

gekurian Feb 13, 2025

Choose a reason for hiding this comment

tianyu-l Feb 15, 2025

Choose a reason for hiding this comment

tianyu-l commented Dec 12, 2024 •

edited

Loading