You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You've mentioned that the specific RL alg doesn't matter much (in terms of final accuracy). So it makes sense to prefer the more efficient method. Any statistics/rough estimates about how they compare?
The text was updated successfully, but these errors were encountered:
From your experiment logs here https://wandb.ai/jiayipan/TinyZero/workspace, I compare the only GRPO run with corresponding PPO run, it appears that the PPO runs on 8 machines with larger gpu memory(vs 2 machines for the GRPO run), yet the PPO speedup ratio seems to be less than the compute power ratio, so it seems that under this setting GRPO is more efficient?
It seems if the base model can sometimes get the good response , and have 'seen' good reasoning traces during pre-training phase, the RL methods does not matters.
You've mentioned that the specific RL alg doesn't matter much (in terms of final accuracy). So it makes sense to prefer the more efficient method. Any statistics/rough estimates about how they compare?
The text was updated successfully, but these errors were encountered: