Skip to content

FloyedShen/VESPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,931 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Paper Github

📖 Overview

Off-policy updates are inevitable in RL for LLMs due to rollout staleness, asynchronous training, and training-inference mismatches. VESPO incorporates variance reduction into a variational formulation and derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without token-level approximation or length normalization.

    

The proposal Q* balances proximity to both the behavior policy μ and the target π under an importance weight budget. (Right) Training reward with staleness N=4: VESPO remains stable while GRPO and SAPO collapse. VESPO scales to 64× staleness and fully asynchronous training without divergence.

From a REINFORCE perspective, what matters is the effective coefficient on ∇log π, i.e., φ(W)=W·f'(W) (bottom row). VESPO's gamma-shaped kernel provides separate control over positive and negative advantages, offering more flexibility than hard clipping or fixed normalization.

📊 Main Results

VESPO's robustness extends to fully asynchronous training, where rollout and training run on separate node groups with multi-step policy lag.

Training dynamics under fully asynchronous training on Qwen3-30B-A3B-Base. VESPO maintains stable training and achieves the highest reward and benchmark accuracy.

Note

For complete results across different staleness ratios (N=4 to 64), model scales, and ablation studies, please refer to our paper.

🚀 Getting Started

The core VESPO policy loss is in core_algos.py. Training scripts are under recipe/vespo/run/.

1. Install — follow the veRL documentation to set up the environment.

2. Prepare data

cd recipe/vespo/tools
python preprocess_datasets.py

3. Train

Edit the model and data paths in the script, then launch with a Ray cluster:

# Synchronous (N=8, 32 GPUs)
bash recipe/vespo/run/sync/vespo_N_8.sh

# Fully asynchronous (48 rollout + 16 train GPUs)
bash recipe/vespo/run/fully_async/vespo_S_1.0_N_4.sh

Tip

Additional synchronous scripts for other staleness ratios are available under recipe/vespo/run/sync/ (N=16, 32, 64).

📝 Citation

If you find this work useful, please consider citing:

@misc{shen2026vespovariationalsequencelevelsoft,
  title={VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training},
  author={Guobin Shen and Chenxiao Zhao and Xiang Cheng and Lei Huang and Xing Yu},
  year={2026},
  eprint={2602.10693},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2602.10693},
}

Attribution

Our implementation is based on a recent version of veRL.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors