What dataset did math use?

Looking at the code below, 
it looks like it can be evaluated using the "math_test" or "math_500_test" dataset.
https://github.com/openai/simple-evals/blob/ee3b0318d8d1d9d72755a4120879be65f7c07e9e/math_eval.py#L32

What dataset was used to obtain the scores listed in Benchmark Results in README.md?