Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logic Inconsistency In ScatterMoE during Expert Parallel #121

Open
fabianlim opened this issue Jan 24, 2025 · 2 comments
Open

Logic Inconsistency In ScatterMoE during Expert Parallel #121

fabianlim opened this issue Jan 24, 2025 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@fabianlim
Copy link
Contributor

fabianlim commented Jan 24, 2025

@willmj I noticed there is some inconsistency in the logic, although the behavior is correct

  1. When creating the ScatterMoE we use num_experts_per_device. In the case ep_degree > 1, then this will result in a the router weights having num_experts_per_device outputs.
  2. But the router weights need to be replicated across device, and this does happen in load_experts_onto_device, because the state_dict sd loaded here, will always result in the full-sized router

So we end up with this inconsistency

(Pdb) mod
Linear(in_features=1536, out_features=20, bias=False)
(Pdb) mod.weight.shape
torch.Size([40, 1536])
@fabianlim fabianlim added bug Something isn't working help wanted Extra attention is needed labels Jan 24, 2025
@willmj
Copy link
Collaborator

willmj commented Jan 24, 2025

Can you share what we should expect from this behavior?

@fabianlim
Copy link
Contributor Author

fabianlim commented Jan 24, 2025

@willmj in the above example, we expect the out_features of Linear to equal to 40. The problem is coming here because we created the Linear module wrongly, then overwrote its parameters with the correct size.. causing this inconsistency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants