Can AI models build a world which today's generative models can only dream of?
BuilderBench is a benchmark designed to facilitate research on open-ended exploration, embodied reasoning and reinforcement learning (RL). Features include:
- A parallelizable and hardware-accelerated simulator built using MuJoCo and Jax. Training a PPO policy to pick and place a block takes less than 5 minutes on one GPU and twelve CPU threads.
- A task-suite of 42 (
$\times$ 4 variations) tasks, where each task requires qualitatively different reasoning capabilities. - Single file implementations for two self-supervised RL and four RL algorithms in jax.
For more details, check out the project website and research paper.
We have tested the installation on Ubuntu 22.04 and Ubuntu 24.04 using python 3.10.
Clone the repository and enter the main folder.
The main dependencies for BuilderBench environments is mujoco, jax, and optax. For installing the BuilderBench environments:
pip install -e .For using reference implementations or developing new algorithms:
pip install -e ".[all]"The environment consists of a robot hand that can navigate in 3D space and interact with a set of cube shaped blocks. A task corresponds to a physically stable target structure built using cubes. Tasks are specified using the positions of the blocks in the target structure. A central insight of builderbench is that despite this seemingly simple setup, tasks can be arbitrarily complex and long-horizon and can require multiple steps of high-level reasoning. The builderbench task suite consists of over 40 such carefully curated tasks. Check out the project website for visualizations and the list of tasks. All tasks are defined in the create_task_data.py file.
The step function of the environment is parallelized using multi-thread pooling implemented by MuJoCo's rollout functionality in C++. The rest of the environment code is written in jax in a jit friendly manner. The rollout function is used as a jax callback and the environments can be compiled end to end on jax, enjoying the benefits of jit and vmap. For instance, a PPO policy can be trained in less than 5 minutes to successfully pick and place a cube.
- Rollout uses MuJoCo's native simulation code written in C/C++. This circumvents issues faced by MuJoCo MJX when running scenes with many contacts. This is true in the case of building with large number of blocks.
- MuJoCo Warp allows scaling MuJoCo GPU simulation to much larger scenes. The main advantage of using MuJoCo Warp would be to run the environment completely on GPU and make the entire training loop simpler. We have combined BuilderBench with MuJoCo Warp in the warp branch. Currently, accurately simulating scenes and training using the warp backend is 5 times slower than rollout. This is for two main reasons. First, Warp is still in a beta release and some features have not been implemented (for example, the no slip solver). Second, we have not yet been able to tune the XML parameters to ensure training is both fast and accurate. Reach out if you want to collaborate to make this happen. The warp backend will become default once it is equally fast and if we are able to manually solve all tasks in the BuilderBench task-suite using it.
To evaluate open-ended exploration, embodied reasoning and generalization, we design the self-supervised protocol. As shown in the figure, in this protocol agents have to explore the environment in a self-supervised manner and learn policies that can solve unseen tasks at test time. We also provide a debug single-task supervised protocol meant to provide additional feedback for researchers. In this protocol, agents are trained and tested on the same task.
Use the following command to run the MEGA algorithm in an environment with two cube. The policy will be evaluated in fixed intervals, on all tasks in the task-suite that correspond to two cubes.
cd impls
python play_ppo_mega.py --env_id=cube-2-playUse the following command to run the PPO algorithm on the first task in an environment with one cube. The policy will be evaluated in fixed intervals on the same task.
cd impls
python ppo.py --env_id=cube-1-task1By default, training runs will store checkpoints at regular intervals in a impls/checkpoint/ folder. To visualize how these checkpoints perform, we have provided code in impls/video.py. This file will iterate over all the training runs present in the given folder (impls/checkpoint/ by default) and record and save a video for all the checkpoints of every training run. The code uses PPO's checkpoints as an example, but other algorithms can be visualized similarly.
The core structure of the codebase is as follows:
builderbench/assets/assets for defining MuJoCo modelstasks/meta-data for all tasksxmls/xml files for defining MuJoCo modelsconstants.pypredefined constants used for the environmentcreate_task_data.pytask definition and task data creationbuild_block.pysupervised singletask protocol environment definitionbuild_block_play.pyself-supervised multitask protocol environment definitionenv_utils.pyenvironment utilities.
impls/utils/buffer.pyreplay buffer for off policy algorithmsevaluation.pysupervised singletask protocol evaluationevaluation_play.pyself-supervised multitask evaluationnetworks.pynetwork definitionsrunning_statistics.pynormalization functionswrapper.pyenvironment wrappers
crl.pycontrastive reinforcement learning : https://arxiv.org/abs/2206.07568play_ppo_goalkde.pymaximum entropy goal achievement: https://arxiv.org/pdf/2007.02832play_ppo_sfl.pysampling for learnability: https://arxiv.org/abs/2408.15099ppo.pyproximal policy optimization: https://arxiv.org/abs/1707.06347ppo_rnd.pyrandom network distillation: https://arxiv.org/abs/1810.12894sac.pysoft actor critic: https://arxiv.org/abs/1801.01290video.pycode to record videos of policy checkpoints
- MuJoCo Playground for environment structuring.
- MuJoCo for the multithreading rollout functionality.
- MuJoCo Menagerie for the robot hand model.
- Brax for reference proximal policy optimization (ppo) implementation.
- JaxGCRL for reference contrastive RL implementation.
@misc{ghugare2025builderbench,
title={BuilderBench -- A benchmark for generalist agents},
author={Raj Ghugare and Catherine Ji and Kathryn Wantlin and Jin Schofield and Benjamin Eysenbach},
year={2025},
eprint={2510.06288},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.06288},
}
