Questions on Multiverse Evaluation Protocols

Hi Multiverse Team,

Thank you for the impressive results! We've been doing some analysis on the reported results and have a few questions.

Specifically, we’re trying to reproduce the results presented in the Multiverse paper both with the released models and our reproduced model with the released training code.

<img width="974" height="490" alt="Image" src="https://github.com/user-attachments/assets/a255ede8-4b5c-43dc-87cb-f4c5982caeb8" />

We’ve noticed a few differences we’re hoping to better understand:
1. For the **Autoregressive-32B**, we could mostly reproduce the performance on AIME25 and MATH500. Interestingly, our reproduced AIME24 results are higher than reported, and this holds for both the released and reproduced models.
2. The **Multiverse-32B** model underperforms relative to the paper’s reported numbers on our end, even though our reproduced version gets closer on AIME25 and MATH500. The largest gap remains on AIME24.

For context, our reproduced AIME24/25 scores are computed using Avg@16 (i.e., averaged over 16 samples per prompt). We use the released Multiverse SGLang Engine to evaluate the Multiverse models.

Additionally, we’re curious about the **s1.1** results. Table 5 of the s1 [paper](https://arxiv.org/abs/2501.19393) (the table below) reports s1.1 scores of 56.7 (AIME24) and 50.0 (AIME25) without budget forcing. However, the reported s1.1 result in the Multiverse paper, Table 1, is 52.9 for AIME24 and 41.7 for AIME25.

<img width="1172" height="756" alt="Image" src="https://github.com/user-attachments/assets/5160dfdf-6b9a-4532-9922-8a40befb550c" />

Would you mind providing some suggestions for us to look into? We believe these differences might stem from the evaluation protocol. We also tested models such as DeepSeek-R1-Distill-Qwen-32B and we are able to reproduce their results on AIME24, so it seems like our evaluation pipeline is not biased on AIME24 on some other models. Would you be able to share an early preview of your protocol with us, or perhaps provide pointers on the prompt set to use for evaluation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions on Multiverse Evaluation Protocols #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions on Multiverse Evaluation Protocols #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions