Over the years, we at MongoDB have developed tooling within our correctness testing infrastructure
to make it easier to debug crashes (by collecting core dumps), hangs (by collecting thread stacks
and lock requests), and data corruption (by collecting data files). However, we have yet to evolve a
better strategy around debugging race conditions and still depend on an engineer to run the failed
test many times with additional logging, or to have them think really hard about where in the code
to add a sleep. Technologies such as rr
may help us form a better story for investigating
race-related issues without requiring effort from an engineer to manually reproduce the failure.
git clone https://github.com/visemet/mongodb-rr-experiment.git
cd mongodb-rr-experiment
The following instructions were adapted from https://github.com/mozilla/rr/wiki/Building-And-Installing.
sudo apt update
sudo apt install \
capnproto \
ccache \
clang \
cmake \
coreutils \
g++-multilib \
gdb \
git \
libcapnp-dev \
make \
manpages-dev \
ninja-build \
pkg-config \
python-pexpect \
python3-pexpect
git clone https://github.com/mozilla/rr.git
cd rr
git checkout 5.2.0
CC=clang CXX=clang++ cmake -B build/ -G Ninja -Ddisable32bit=ON .
cmake --build .
sudo cmake --build . --target install
sudo sysctl kernel.perf_event_paranoid=1
The following instructions were adapted from https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source.
sudo apt install libcurl4-openssl-dev python-pip
git clone https://github.com/mongodb/mongo.git
cd mongo
git remote add visemet https://github.com/visemet/mongo.git
git fetch visemet mongodb-rr-experiment
git checkout visemet/mongodb-rr-experiment
python2 -m pip install -r etc/pip/dev-requirements.txt
python2 -m pip install --user psutil==5.4.8
You may notice when comparing the columns in the tables below that (1) there weren't any cases where
a failure could only be reproduced using rr
, and (2) there were multiple cases where a failure
could only be reproduced manually. This shouldn't be interpreted as saying rr
is ineffective. It
is still very likely that rr
would save an engineer both time and effort when investigating a
build failure. The results simply demonstrate that it isn't possible to solely rely on rr
as the
answer to investigating all race-related issues.
Build failure | Able to reproduce? | |
---|---|---|
using rr | manually | |
BF-9810 | ||
BF-9958 | ✓ | ✓ |
BF-10742 | ✓ | ✓ |
BF-10932 | ✓ | ✓ |
Build failure | Able to reproduce? | |
---|---|---|
using rr | manually | |
BF-6346 | ✓ | |
BF-8424 | ✓ | ✓ |
BF-9030 |
Build failure | Able to reproduce? | |
---|---|---|
using rr | manually | |
BF-7114 | ✓ | |
BF-7588 | ✓ | ✓ |
BF-7888 | ✓ | |
BF-8258 | ||
BF-8642 | ✓ | ✓ |
BF-9248 | ✓ | |
BF-9426 | ||
BF-9552 | ✓ | ✓ |
BF-9864 | ||
BF-10729 | ✓ | ✓ |
BF-11054 | ✓ | ✓ |