Skip to content

Questions on Multiverse Evaluation Protocols #4

@TonyLianLong

Description

@TonyLianLong

Hi Multiverse Team,

Thank you for the impressive results! We've been doing some analysis on the reported results and have a few questions.

Specifically, we’re trying to reproduce the results presented in the Multiverse paper both with the released models and our reproduced model with the released training code.

Image

We’ve noticed a few differences we’re hoping to better understand:

  1. For the Autoregressive-32B, we could mostly reproduce the performance on AIME25 and MATH500. Interestingly, our reproduced AIME24 results are higher than reported, and this holds for both the released and reproduced models.
  2. The Multiverse-32B model underperforms relative to the paper’s reported numbers on our end, even though our reproduced version gets closer on AIME25 and MATH500. The largest gap remains on AIME24.

For context, our reproduced AIME24/25 scores are computed using Avg@16 (i.e., averaged over 16 samples per prompt). We use the released Multiverse SGLang Engine to evaluate the Multiverse models.

Additionally, we’re curious about the s1.1 results. Table 5 of the s1 paper (the table below) reports s1.1 scores of 56.7 (AIME24) and 50.0 (AIME25) without budget forcing. However, the reported s1.1 result in the Multiverse paper, Table 1, is 52.9 for AIME24 and 41.7 for AIME25.

Image

Would you mind providing some suggestions for us to look into? We believe these differences might stem from the evaluation protocol. We also tested models such as DeepSeek-R1-Distill-Qwen-32B and we are able to reproduce their results on AIME24, so it seems like our evaluation pipeline is not biased on AIME24 on some other models. Would you be able to share an early preview of your protocol with us, or perhaps provide pointers on the prompt set to use for evaluation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions