How rr works

rr records nondeterministic executions and debugs them deterministically.

Practical tool; version 1.2 is latest release. Used to debug Firefox.

Why?

Deterministic debugging: record nondeterministic failure once, debug deterministically forever.

Record intermittent test failures "at scale" online, debug the recordings offline at leisure.

Omniscient debugging: issue queries over program state changes; go backwards in time.

Overview

rr record prog --args
saves recording

rr replay
debugger socket drives replay of most recent recording

Most of an application's execution is deterministic.

rr records the nondeterministic parts.

Examples of nondeterministic inputs

Then during replay, emulate system calls and rdtsc by writing the saved nondeterministic data back to the tracee.

Shared-memory multitasking is a nondeterministic "input".

... but modern hardware can't record it efficiently. So rr doesn't record truly parallel executions.

Scheduling tasks

Can switch tasks at syscalls. Must preempt straight-line code too; and replay the preemptions deterministically.

Hardware performance counters (HPCs)

Recent chips count instructions-retired, branches-retired, ..., and can be programmed to interrupt after a count of x.

Simulate task preemption with HPC interrupts.

Idea: program insns-retired counter to interrupt after k . That k approximates a time slice.

Replaying preemption

Record the insn-retired counter value v to the trace file. During replay, program the interrupt for v. Voilà.

UNIX signals are recorded and replayed like task preemptions.

Record counter value v and signum. Replay by interrupting after v and "delivering" signum.

System requirements

Basic requirements

rr touches low-level details of machine architecture, by necessity; f.e. kernel syscall ABI.

Supporting more ISAs is "just work". x86-64 coming.

ARM chips don't have the performance counters that rr requires.

So no ARM support is possible at the moment.

Precise HPC events identify points in execution.

Precise replay of signals and preemption requires interrupting tracees at these events.

Performance counters are messier in reality

seccomp-bpf enables rr to selectively trace syscalls.

Only trap to rr for syscalls that can't be handled in the tracee. Over 100x faster in µbenchmarks.

Buffer syscalls; flush buffer as "super event"

Recorder implementation

Tasks are controlled through the ptrace API.

HPCs are controlled through the perf event API.

The first traced task is forked from rr. After that, clone() and fork()from tracees add new tasks.

And tasks die at exit().

Simplified recorder loop

    while live_task():
        task t = schedule()
        if not status_changed(t):
            resume_execution(t)
        state_change(t)
    

Scheduling a task

    task schedule():
        for each task t, round-robin:
            if is_runnable(t)
               or status_changed(t):
                return t
        tid = waitpid(ANY_CHILD_TASK)
        return task_map[tid]
  

Tasks changing status

    bool status_changed(task t):
        # Non-blocking
        return waitpid(t.tid, WNOHANG)

    # Deceptively simple: includes
    # syscalls, signals, ptrace
    # events ...
  

Resuming task execution

Invariant: At most one task is running userspace code. All other tasks are either idle or awaiting completion of a syscall.

Multiple running tasks suffer from shared-memory hazards.

rr doesn't attempt to record these hazards, so can't replay them deterministically.

Resuming a task, simplified

    void resume_execution(task t):
        ptrace(PTRACE_SYSCALL, t.tid)
        waitpid(t.tid)  # Blocking

    # Again, deceptively simple: traps
    # for syscalls, signals, ptrace
    # events ...
  

Most recorder work is done for state_change(task t).

But before looking at it, a few digressions ...

Generating time-slice interrupts

Trapping tracees at rdtsc

Tracees generate ptrace events by executing fork, clone, exit, and some other syscalls.

ptrace events exist for linux reasons that aren't interesting.

(rr tracees can share memory mappings with other processes.

Not possible to record efficiently in SW; needs kernel and/or HW support. Unsupported until then.)

Tracee events recorded by state_change()

Some syscalls must be executed atomically; can't switch task until syscall finishes.

Ex: mmap modifies address space, can race other syscalls.

On the other hand, some syscalls require switching; syscall can't finish until the task switches.

Ex: waitpid() only returns after child runs, changes state.

Problem: kernel writes non-atomic syscall outparams in an order that rr can't record.

Kernel outparam writes race rr tracees in userspace, syscalls.

Solution: allocate scratch space for the outparams of non-atomic syscalls. At syscall exit, write scratch data back to outparams.

→ rr orders outparam writes

POSIX signals can arrive at practically any point in execution and invoke signal handler code.

→ tracee code can be (almost) arbitrarily re-entered

Linux exits tracees out of syscalls with an ERESTART* error code before delivering signals. Syscall not always restarted after signal.

Sighandler nesting gets complex.

When a signal becomes pending

Sighandlers exit using the SYS_sigreturn syscall. rr uses these calls to help determine whether interrupted syscalls are restarted.

Tracees exit in unpredictable order at fatal signals like SIGABRT. Naïve waitpid() calls deadlock.

exit_group: same problem.

"Unstable" tracee exit

rr solves this by detaching and not waiting on affected tracees.

Breaks rr scheduling invariant.

Syscall buffer

ptrace traps are expensive. Better to do as much work in tracee process as possible.

Use seccomp-bpf to selectively trap syscalls.

Syscall hooks are LD_PRELOAD'd into tracees.

Hook functions record kernel return value and outparam data to the syscall buffer.

rr monkeypatches __kernel_vsyscall() in vdso to jump to rr trampoline.

Trampoline calls dispatcher, which calls rr hook if available.

Untraced syscalls are recorded to syscallbuf by tracee. Traced events recorded by the rr process "flush" the tracee's syscallbuf.

Lib falls back on traced syscalls.

Simplified example of syscallbuf hook function

static int sys_close(int fd)
{
	long ret;
	if (!start_buffer_syscall(SYS_close))
		/* Fall back on traced syscall.
                 * This generates a ptrace trap. */
		return syscall(SYS_close, fd);

	/* Untraced syscall. Does not generate
         * ptrace trap.*/
	ret = untraced_syscall1(SYS_close, fd);
        /* Save the result to syscall buffer. */
	return commit_syscall(SYS_close, ret);
}
  

How untraced syscalls are made

seccomp-bpf traps generate PTRACE_EVENT_SECCOMP in tracer process.

rr can then PTRACE_SYSCALL the tracee into traced syscall.

Problem: buffered syscalls don't trap to rr, by design. But may-block syscalls (f.e. waitpid()) require rr to schedule another task.

perf events to the rescue: "descheduled" event

Set event to fire on tracee context switch. Event traps to rr.

Buffered syscall blocks → context switch → rr trap

Generating desched events

Saved traces

Traces are saved to ~/.rr by default.

Stored on disk uncompressed. Trace compression is planned.

Trace directory contents

Replayer implementation

Emulate most syscalls using trace data.

Actually execute a small number.

Built around PTRACE_SYSEMU

SYSCALL runs tracee to syscall, executes it.

SYSEMU runs to syscall, doesn't execute it. rr replays side effects.

Replaying time-slice interrupts, in theory

Program instructions counter to interrupt after the recorded t number of instructions.

Tracee stops at t.

Replaying time-slice interrupts, in practice

Finding execution target, in practice

To reiterate, this is not sound.

Deterministic signals were raised by program execution. For example, *NULL = 42;

Replayed "naturally" in the course of execution.

Async signals were raised externally at some execution point during recording.

Replay to that execution point just as for time-slice interrupts.

Replay signal delivery by emulating

If there was a sighandler, restore recorded sigframe and registers at sighandler entry.

Otherwise, nothing else to do.

Replaying buffered syscalls

Read saved buffer from trace.

Replay each syscall as normal, but restore outparam data from records in the read buffer.

Debugger interface

Common commands supported.

c, s, si, b, bt, watch, info regs, thr, info thr ...

(gdb) call foo() can cause replay divergence.

So you're not allowed to do it … for now. Support coming.

Small stub translates from and to gdb remote protocol.

Then passes debugger requests up to rr replayer.

Replayer fulfills requests using ptrace() or cached data.

And resumes tracee execution when asked.

Breakpoints, int $3, stepi, watchpoints all raise SIGTRAP.

$ip, breakpoint table, gdb request, and $DR6 decode trap.

Future work

Checkpointing

Make a "deep fork" of tracee tree during replay.

Run code (or whatever) in copied tree, return to original.

Omniscient debugging (aka chroniclerr)

Use chronicle-style instrumentation to generate execution DB.

Query state changes in DB.

CHESS-style execution search; targeted recording

At each scheduling decision, make a checkpoint.

If execution reaches bad state, done. Else, resume checkpoint.

Other projects

Thanks from the rr team!

Appendix: rr for RnR people

Release 1.2 available today at

rr-project.org

Use cases

Design concerns

rr recorder overview

Trade-off: scheduling from userspace

Headache: kernel writes racing with userspace

rr replayer overview

Replayer headache: slack in counter interrupts

Recorder "fast mode": syscall buffering

Headache: many syscalls made internally in glibc

Headache: buffering syscalls that may block

Fun debugging tricks