Logic Inconsistency In ScatterMoE during Expert Parallel #121

fabianlim · 2025-01-24T14:54:05Z

@willmj I noticed there is some inconsistency in the logic, although the behavior is correct

When creating the ScatterMoE we use num_experts_per_device. In the case ep_degree > 1, then this will result in a the router weights having num_experts_per_device outputs.
But the router weights need to be replicated across device, and this does happen in load_experts_onto_device, because the state_dict sd loaded here, will always result in the full-sized router

So we end up with this inconsistency

(Pdb) mod
Linear(in_features=1536, out_features=20, bias=False)
(Pdb) mod.weight.shape
torch.Size([40, 1536])

The text was updated successfully, but these errors were encountered:

willmj · 2025-01-24T15:05:42Z

Can you share what we should expect from this behavior?

fabianlim · 2025-01-24T15:11:12Z

@willmj in the above example, we expect the out_features of Linear to equal to 40. The problem is coming here because we created the Linear module wrongly, then overwrote its parameters with the correct size.. causing this inconsistency

fabianlim added bug Something isn't working help wanted Extra attention is needed labels Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logic Inconsistency In ScatterMoE during Expert Parallel #121

Logic Inconsistency In ScatterMoE during Expert Parallel #121

fabianlim commented Jan 24, 2025 •

edited

Loading

willmj commented Jan 24, 2025

fabianlim commented Jan 24, 2025 •

edited

Loading

Logic Inconsistency In ScatterMoE during Expert Parallel #121

Logic Inconsistency In ScatterMoE during Expert Parallel #121

Comments

fabianlim commented Jan 24, 2025 • edited Loading

willmj commented Jan 24, 2025

fabianlim commented Jan 24, 2025 • edited Loading

fabianlim commented Jan 24, 2025 •

edited

Loading

fabianlim commented Jan 24, 2025 •

edited

Loading