Off-policy updates are inevitable in RL for LLMs due to rollout staleness, asynchronous training, and training-inference mismatches. VESPO incorporates variance reduction into a variational formulation and derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without token-level approximation or length normalization.
The proposal Q* balances proximity to both the behavior policy μ and the target π under an importance weight budget. (Right) Training reward with staleness N=4: VESPO remains stable while GRPO and SAPO collapse. VESPO scales to 64× staleness and fully asynchronous training without divergence.
From a REINFORCE perspective, what matters is the effective coefficient on ∇log π, i.e., φ(W)=W·f'(W) (bottom row). VESPO's gamma-shaped kernel provides separate control over positive and negative advantages, offering more flexibility than hard clipping or fixed normalization.
VESPO's robustness extends to fully asynchronous training, where rollout and training run on separate node groups with multi-step policy lag.
Training dynamics under fully asynchronous training on Qwen3-30B-A3B-Base. VESPO maintains stable training and achieves the highest reward and benchmark accuracy.
Note
For complete results across different staleness ratios (N=4 to 64), model scales, and ablation studies, please refer to our paper.
The core VESPO policy loss is in core_algos.py. Training scripts are under recipe/vespo/run/.
1. Install — follow the veRL documentation to set up the environment.
2. Prepare data
cd recipe/vespo/tools
python preprocess_datasets.py3. Train
Edit the model and data paths in the script, then launch with a Ray cluster:
# Synchronous (N=8, 32 GPUs)
bash recipe/vespo/run/sync/vespo_N_8.sh
# Fully asynchronous (48 rollout + 16 train GPUs)
bash recipe/vespo/run/fully_async/vespo_S_1.0_N_4.shTip
Additional synchronous scripts for other staleness ratios are available under recipe/vespo/run/sync/ (N=16, 32, 64).
If you find this work useful, please consider citing:
@misc{shen2026vespovariationalsequencelevelsoft,
title={VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training},
author={Guobin Shen and Chenxiao Zhao and Xiang Cheng and Lei Huang and Xing Yu},
year={2026},
eprint={2602.10693},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.10693},
}Our implementation is based on a recent version of veRL.



