-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Hi Multiverse Team,
Thank you for the impressive results! We've been doing some analysis on the reported results and have a few questions.
Specifically, we’re trying to reproduce the results presented in the Multiverse paper both with the released models and our reproduced model with the released training code.
We’ve noticed a few differences we’re hoping to better understand:
- For the Autoregressive-32B, we could mostly reproduce the performance on AIME25 and MATH500. Interestingly, our reproduced AIME24 results are higher than reported, and this holds for both the released and reproduced models.
- The Multiverse-32B model underperforms relative to the paper’s reported numbers on our end, even though our reproduced version gets closer on AIME25 and MATH500. The largest gap remains on AIME24.
For context, our reproduced AIME24/25 scores are computed using Avg@16 (i.e., averaged over 16 samples per prompt). We use the released Multiverse SGLang Engine to evaluate the Multiverse models.
Additionally, we’re curious about the s1.1 results. Table 5 of the s1 paper (the table below) reports s1.1 scores of 56.7 (AIME24) and 50.0 (AIME25) without budget forcing. However, the reported s1.1 result in the Multiverse paper, Table 1, is 52.9 for AIME24 and 41.7 for AIME25.
Would you mind providing some suggestions for us to look into? We believe these differences might stem from the evaluation protocol. We also tested models such as DeepSeek-R1-Distill-Qwen-32B and we are able to reproduce their results on AIME24, so it seems like our evaluation pipeline is not biased on AIME24 on some other models. Would you be able to share an early preview of your protocol with us, or perhaps provide pointers on the prompt set to use for evaluation?