Start by using rr to record your application:
$ rr record /your/application --args ... FAIL: oh no!
The entire execution, including the failure, was saved to disk. That recording can now be debugged.
$ rr replay GNU gdb (GDB) ... ... 0x4cee2050 in _start () from /lib/ld-linux.so.2 (gdb)
Remember, you're debugging the recorded trace deterministically; not a live, nondeterministic execution. The replayed execution's address spaces, register contents, syscall data etc are exactly the same in every run.
Most of the common gdb commands can be used.
(gdb) break mozilla::dom::HTMLMediaElement::HTMLMediaElement ... (gdb) continue Continuing. ... Breakpoint 1, mozilla::dom::HTMLMediaElement::HTMLMediaElement (this=0x61362f70, aNodeInfo=...) ...
If you need to restart the debugging session, for example
because you missed breaking on some critical execution point, no
problem. Just use gdb's
run command to restart
(gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y ... Breakpoint 1, mozilla::dom::HTMLMediaElement::HTMLMediaElement (this=0x61362f70, aNodeInfo=...) ... (gdb)
run command started another replay run of your
recording from the beginning. But after the session restarted,
the same execution was replayed again. And all your
debugging state was preserved across the restart.
Note that the the
this pointer of the
dynamically-allocated object was the same in both replay
sessions. Memory allocations are exactly the same in each
replay, meaning you can hard-code addresses you want to watch.
This video shows a quick demo of rr recording and replaying Firefox.
This video demonstrates rr's basic capabilities in a bit more detail.
cd /tmp wget https://mozilla.github.io/rr/releases/rr-3.2.0-Linux-$(uname -m).rpm sudo rpm -i rr-3.2.0-Linux-$(uname -m).rpm
cd /tmp wget https://mozilla.github.io/rr/releases/rr-3.2.0-Linux-$(uname -m).deb sudo dpkg -i rr-3.2.0-Linux-$(uname -m).deb
Everyone who's worked on a nontrivial application (like Firefox) has gone through the pain of debugging an intermittently-reproducible bug. Since nontrivial applications are nondeterministic, each execution is different, and you may require 5, 10, or even 100 runs just to see the bug manifest.
It's hard to debug these bugs with traditional techniques because single stepping, setting breakpoints, inspecting program state, etc, is all a waste of time if the program execution you're debugging ends up not even exhibiting the bug. Even when you can reproduce the bug consistently, important information such as the addresses of suspect objects is unpredictable from run to run. Given that software developers spend a lot of time finding and fixing bugs, nondeterminism has a major impact on their work.
And there are intermittent bugs that are so hard to reproduce that they're literally not the worth the time to fix with traditional techniques. However, for big projects like Firefox with its half-billion users, a bug that only reproduces 1 out of 10,000 test runs can still have a negative impact on users.
rr solves these problems by splitting debugging into two phases: first recording, in which the application's execution history is saved; then deterministic debugging of the saved trace: using gdb to control replay of the trace, as many times as you want.
The saved execution history captures all nondeterminism in the program's execution. By replaying that trace in the right way, rr guarantees each debugging session is entirely deterministic. The memory layout is always the same, the addresses of objects don't change, register values are identical, syscalls return the same data, etc.
The benefit to developers is obvious: an intermittent bug can be
recorded by a script over lunchtime, say, and then debugged at
leisure in the afternoon. Multiple cores can be used in
parallel to record failures. If you accidentally set a
breakpoint in the wrong place and miss gathering critical
information, your precious intermittent failure isn't lost.
Just fix your breakpoint and then tell gdb to
the recording back from the beginning again. Even for easily
reproducible bugs, a repeatable, deterministic, debugging
session is a powerful tool on top of traditional debugging.
And for projects like Firefox which run literally millions of tests a day on a vast build and test infrastructure, intermittent failures in those test runs can be recorded on the infrastructure itself and then deterministically debugged at some later time, offline.
Tools like fuzzers and randomized fault injectors become even more powerful when used with rr. Those tools are very good at triggering some intermittent failure, but it's often hard to reproduce that same failure again to debug it. With rr, the randomized execution can simply be recorded. If the execution failed, then the saved recording can be used to deterministically debug the problem.
So rr lowers the cost of fixing intermittent bugs. This allows a new class of bugs to be fixed with the same amount of engineering time and money, which in turn produces higher-quality software for the same cost.
Deterministic debugging is an old idea; many systems have preceded rr. What makes rr different, in our opinion, are the design goals:
The overhead of rr depends on your application's workload. On Firefox test suites, rr's recording performance is quite usable. We see slowdowns down to ≤ 1.2x. A 1.2x slowdown means that if the suite takes 10 minutes to run by itself, it will take around 12 minutes to be recorded by rr. However, different test suites have different performance characteristics, so they have different overheads as well.
Some of rr's limitations are inherent, and some will be removed in future releases.
This presentation provides an overview of the rr implementation and is meant for potential rr developers. There are some bonus slides intended to introduce rr to record/replay researchers.
The rr wiki contains pages that cover technical topics related to rr.
More information about rr will be posted in the future.