🚀 The feature, motivation and pitch
Minimax (and other models) may have rms norm inside the sharded attention region --> we must shard the rms norm weights like the rest of the weights
Alternatives
No response
Additional context
No response
Before submitting a new issue...