Skip to content

Infini-AI-Lab/M2PO

Repository files navigation

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Haizhong Zheng1, Jiawei Zhao2, Beidi Chen1
1Carnegie Mellon University, 2Meta AI


TL;DR Our work shows that stale data can be as informative as on-policy data if exploited properly. We introduce M2PO (Second-Moment Trust Proxy Optimization), which constrains the second moment of importance weights to stabilize training. Extensive evaluation across six model scales (1.7B–32B) demonstrates that M2PO achieves stable off-policy training even with data stale by at least 256 updates, matching on-policy performance.

🗞️ News

M2PO Overview

Figure 1 Comparison of on-policy GRPO and off-policy training under a staleness of 256 model updates on Qwen-2.5-32B. Left: Standard GRPO suffers from degradation with stale rollouts, while removing the trust region (GRPO no TR) reveals a clear prosperity-before-collapse phenomenon. In contrast, M2PO achieves stable training and matches on-policy performance even under high staleness. Right: Token clipping ratio comparison shows that M2PO dramatically reduces clipping events compared to GRPO, while avoiding training collapse.

Getting Started

Our implementation is based on volcengine/verl (v0.4.0).

1. Environment Setup

cd project-folder

conda create -n verl05 python==3.11
conda activate verl05

USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install latex2sympy2-extended
pip install math-verify

cd M2PO
pip3 install --no-deps -e .

2. Download & Preprocess Data

You can download the dataset using the following command:

# cd the project folder

bash train-scripts/generate_datasets.sh

3. Training

Train Qwen2.5 Math 7b with M2PO on 8xH100:

bash train-scripts/m2po-qwen-math-7b-s256.sh
bash train-scripts/grpo-qwen-math-7b-s0.sh
bash train-scripts/grpo-qwen-math-7b-s256.sh

See more scripts in train-scripts folder.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published