PPO vs GRPO time and space efficiency #73

Lineark · 2025-02-15T10:54:25Z

You've mentioned that the specific RL alg doesn't matter much (in terms of final accuracy). So it makes sense to prefer the more efficient method. Any statistics/rough estimates about how they compare?

Lineark · 2025-02-15T10:59:10Z

From your experiment logs here https://wandb.ai/jiayipan/TinyZero/workspace, I compare the only GRPO run with corresponding PPO run, it appears that the PPO runs on 8 machines with larger gpu memory(vs 2 machines for the GRPO run), yet the PPO speedup ratio seems to be less than the compute power ratio, so it seems that under this setting GRPO is more efficient?

sworddish · 2025-02-17T10:02:28Z

It seems if the base model can sometimes get the good response , and have 'seen' good reasoning traces during pre-training phase, the RL methods does not matters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO vs GRPO time and space efficiency #73

PPO vs GRPO time and space efficiency #73

Lineark commented Feb 15, 2025

Lineark commented Feb 15, 2025

sworddish commented Feb 17, 2025

PPO vs GRPO time and space efficiency #73

PPO vs GRPO time and space efficiency #73

Comments

Lineark commented Feb 15, 2025

Lineark commented Feb 15, 2025

sworddish commented Feb 17, 2025