Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO vs GRPO time and space efficiency #73

Open
Lineark opened this issue Feb 15, 2025 · 2 comments
Open

PPO vs GRPO time and space efficiency #73

Lineark opened this issue Feb 15, 2025 · 2 comments

Comments

@Lineark
Copy link

Lineark commented Feb 15, 2025

You've mentioned that the specific RL alg doesn't matter much (in terms of final accuracy). So it makes sense to prefer the more efficient method. Any statistics/rough estimates about how they compare?

@Lineark
Copy link
Author

Lineark commented Feb 15, 2025

From your experiment logs here https://wandb.ai/jiayipan/TinyZero/workspace, I compare the only GRPO run with corresponding PPO run, it appears that the PPO runs on 8 machines with larger gpu memory(vs 2 machines for the GRPO run), yet the PPO speedup ratio seems to be less than the compute power ratio, so it seems that under this setting GRPO is more efficient?

@sworddish
Copy link

It seems if the base model can sometimes get the good response , and have 'seen' good reasoning traces during pre-training phase, the RL methods does not matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants