VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

📖 Overview • 📊 Main Results • 🚀 Getting Started • 📝 Citation

📖 Overview

Off-policy updates are inevitable in RL for LLMs due to rollout staleness, asynchronous training, and training-inference mismatches. VESPO incorporates variance reduction into a variational formulation and derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without token-level approximation or length normalization.

The proposal Q* balances proximity to both the behavior policy μ and the target π under an importance weight budget. (Right) Training reward with staleness N=4: VESPO remains stable while GRPO and SAPO collapse. VESPO scales to 64× staleness and fully asynchronous training without divergence.

From a REINFORCE perspective, what matters is the effective coefficient on ∇log π, i.e., φ(W)=W·f'(W) (bottom row). VESPO's gamma-shaped kernel provides separate control over positive and negative advantages, offering more flexibility than hard clipping or fixed normalization.

📊 Main Results

VESPO's robustness extends to fully asynchronous training, where rollout and training run on separate node groups with multi-step policy lag.

Training dynamics under fully asynchronous training on Qwen3-30B-A3B-Base. VESPO maintains stable training and achieves the highest reward and benchmark accuracy.

Note

For complete results across different staleness ratios (N=4 to 64), model scales, and ablation studies, please refer to our paper.

🚀 Getting Started

The core VESPO policy loss is in core_algos.py. Training scripts are under recipe/vespo/run/.

1. Install — follow the veRL documentation to set up the environment.

2. Prepare data

cd recipe/vespo/tools
python preprocess_datasets.py

3. Train

Edit the model and data paths in the script, then launch with a Ray cluster:

# Synchronous (N=8, 32 GPUs)
bash recipe/vespo/run/sync/vespo_N_8.sh

# Fully asynchronous (48 rollout + 16 train GPUs)
bash recipe/vespo/run/fully_async/vespo_S_1.0_N_4.sh

Tip

Additional synchronous scripts for other staleness ratios are available under recipe/vespo/run/sync/ (N=16, 32, 64).

📝 Citation

If you find this work useful, please consider citing:

@misc{shen2026vespovariationalsequencelevelsoft,
  title={VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training},
  author={Guobin Shen and Chenxiao Zhao and Xiang Cheng and Lei Huang and Xing Yu},
  year={2026},
  eprint={2602.10693},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2602.10693},
}

Attribution

Our implementation is based on a recent version of veRL.

Name		Name	Last commit message	Last commit date
Latest commit History 1,931 Commits
docker		docker
docs		docs
examples		examples
figures		figures
recipe/vespo		recipe/vespo
scripts		scripts
tests		tests
verl		verl
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

📖 Overview

📊 Main Results

🚀 Getting Started

📝 Citation

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

📖 Overview

📊 Main Results

🚀 Getting Started

📝 Citation

Attribution

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages