Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Haizhong Zheng¹, Jiawei Zhao², Beidi Chen¹
¹Carnegie Mellon University, ²Meta AI

TL;DR Our work shows that stale data can be as informative as on-policy data if exploited properly. We introduce M2PO (Second-Moment Trust Proxy Optimization), which constrains the second moment of importance weights to stabilize training. Extensive evaluation across six model scales (1.7B–32B) demonstrates that M2PO achieves stable off-policy training even with data stale by at least 256 updates, matching on-policy performance.

🗞️ News

[2025.10.2] Blog post released: Prosperity Before Collapse – M2PO.
[2025.10.2] Paper preprint available on arXiv.

Figure 1 Comparison of on-policy GRPO and off-policy training under a staleness of 256 model updates on Qwen-2.5-32B. Left: Standard GRPO suffers from degradation with stale rollouts, while removing the trust region (GRPO no TR) reveals a clear prosperity-before-collapse phenomenon. In contrast, M2PO achieves stable training and matches on-policy performance even under high staleness. Right: Token clipping ratio comparison shows that M2PO dramatically reduces clipping events compared to GRPO, while avoiding training collapse.

Getting Started

Our implementation is based on volcengine/verl (v0.4.0).

1. Environment Setup

cd project-folder

conda create -n verl05 python==3.11
conda activate verl05

USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install latex2sympy2-extended
pip install math-verify

cd M2PO
pip3 install --no-deps -e .

2. Download & Preprocess Data

You can download the dataset using the following command:

# cd the project folder

bash train-scripts/generate_datasets.sh

3. Training

Train Qwen2.5 Math 7b with M2PO on 8xH100:

bash train-scripts/m2po-qwen-math-7b-s256.sh

bash train-scripts/grpo-qwen-math-7b-s0.sh

bash train-scripts/grpo-qwen-math-7b-s256.sh

See more scripts in train-scripts folder.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docker		docker
docs		docs
examples		examples
raw-data		raw-data
recipe		recipe
scripts		scripts
tests		tests
train-scripts		train-scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fig1.png		fig1.png
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

🗞️ News

Getting Started

1. Environment Setup

2. Download & Preprocess Data

3. Training

About

Uh oh!

Releases

Packages

Languages

License

Infini-AI-Lab/M2PO

Folders and files

Latest commit

History

Repository files navigation

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

🗞️ News

Getting Started

1. Environment Setup

2. Download & Preprocess Data

3. Training

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages