You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@willmj I noticed there is some inconsistency in the logic, although the behavior is correct
When creating the ScatterMoE we use num_experts_per_device. In the case ep_degree > 1, then this will result in a the router weights having num_experts_per_device outputs.
But the router weights need to be replicated across device, and this does happen in load_experts_onto_device, because the state_dict sd loaded here, will always result in the full-sized router
So we end up with this inconsistency
(Pdb) mod
Linear(in_features=1536, out_features=20, bias=False)
(Pdb) mod.weight.shape
torch.Size([40, 1536])
The text was updated successfully, but these errors were encountered:
@willmj in the above example, we expect the out_features of Linear to equal to 40. The problem is coming here because we created the Linear module wrongly, then overwrote its parameters with the correct size.. causing this inconsistency
@willmj I noticed there is some inconsistency in the logic, although the behavior is correct
num_experts_per_device
. In the caseep_degree > 1
, then this will result in a the routerweights
havingnum_experts_per_device
outputs.router
weights need to be replicated across device, and this does happen in load_experts_onto_device, because the state_dictsd
loaded here, will always result in the full-sized routerSo we end up with this inconsistency
The text was updated successfully, but these errors were encountered: