Skip to content

[DeepSeek] remove numpy, avoid tolist in gatherd_idxs #1019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 26, 2025

Conversation

garrett361
Copy link
Contributor

Removes the numpy usage and tolist CUDA sync when computing gatherd_idxs.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 26, 2025
@garrett361 garrett361 force-pushed the deepseek-indexing branch 2 times, most recently from a453229 to 1f1b16d Compare March 26, 2025 19:17
@kwen2501
Copy link
Contributor

Thanks much for the PR! Indeed would look much nicer!

Do you mind testing your PR with
torchrun --standalone --nproc-per-node 4 generate.py
?

I am seeing this result:

Output: You are a helpful assistant.
User: What is 2+2?

Assistant: ' ' ' ' ' ' ' ' ' ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ': ':

@kwen2501 kwen2501 self-requested a review March 26, 2025 19:21
@garrett361
Copy link
Contributor Author

Hmm, weird, yes I'm seeing the same. I hadn't tested generate.py, but I made some small unit tests and also checked the losses and grad norm in run.py for an randomly initialized model.

Let me look into it and I'll ping you when I figure it out.

@garrett361
Copy link
Contributor Author

Ah, it was just the self.ep_rank * self.experts_per_rank offset I added to the gatherd_idx. The conventions here are slightly different than what I'm using elsewhere and that misunderstanding affected my unit tests, too. Removed the offset and it's fixed now:

⬢ [podman] ❯ torchrun --standalone --nproc-per-node 4 torchtitan/experiments/deepseek_v3/generate.py
W0326 19:53:05.600000 1950302 torch/distributed/run.py:792]
W0326 19:53:05.600000 1950302 torch/distributed/run.py:792] *****************************************
W0326 19:53:05.600000 1950302 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0326 19:53:05.600000 1950302 torch/distributed/run.py:792] *****************************************
Creating model stage 0 of 2
Creating model stage 0 of 2
Creating model stage 1 of 2
Creating model stage 1 of 2
EP rank [0]: Created Symmetric Memory for MoE
Running inference with deepseek-ai/DeepSeek-V2-Lite-Chat on (2, 2) mesh
EP rank [1]: Created Symmetric Memory for MoE
EP rank [1]: Created Symmetric Memory for MoE
EP rank [0]: Created Symmetric Memory for MoE

Output: You are a helpful assistant.
User: What is 2+2?

Assistant: 2 + 2 equals 4.

Closing inference mesh...

Apologies for that; should have tested more carefully.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you for the nice code!

@EugenHotaj
Copy link
Contributor

@garrett361 Nice add! Do you notice any speed boost from removing cpu sync?

@kwen2501 kwen2501 merged commit bab83a4 into pytorch:main Mar 26, 2025
5 of 6 checks passed
@garrett361
Copy link
Contributor Author

Thanks @EugenHotaj ! Good question, but I didn't time it and the {run,generate}.py scripts don't have any timing metrics reported, currently, so can't say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants