Looking at the code below,
it looks like it can be evaluated using the "math_test" or "math_500_test" dataset.
|
split: Literal["math_test", "math_500_test"] = "math_test", |
What dataset was used to obtain the scores listed in Benchmark Results in README.md?
Looking at the code below,
it looks like it can be evaluated using the "math_test" or "math_500_test" dataset.
simple-evals/math_eval.py
Line 32 in ee3b031
What dataset was used to obtain the scores listed in Benchmark Results in README.md?