3  The Profiling Agent

3.1 From Trigger to Stack Trace

The heart of Bistouri is a feedback loop between the kernel and userspace. We don’t want to profile everything all the time—that’s too expensive. Instead, we use a two-stage mechanism: identification and sampling.

The first stage is reactive. When a process executes (match_comm_on_exec), the kernel-side BPF program checks the new process’s command name against a set of rules stored in an LPM (Longest Prefix Match) trie. If the process matches our criteria—for example, if it’s a binary name we’ve been told to watch—the BPF program emits an event to the trigger_events ring buffer. This signals the userspace agent to “mark” the PID for monitoring by adding it to a fast-path allow-list hash map (pid_filter_map).

The second stage is the actual sampling. We attach a BPF program to a software perf event (CPU_CLOCK) on every CPU. At a fixed frequency—by default 19Hz, a prime number chosen to avoid synchronization with periodic application tasks—the kernel interrupts the CPU and runs our handler. This handler checks the pid_filter_map. If the current PID isn’t there, we exit immediately with minimal overhead. If it is, we perform the “heavy” work: walking the stack for both kernel and user space.

This design ensures that the cost of profiling is proportional to the number of processes we are actually interested in, while the cost for everything else is just a single hash map lookup.

3.2 BPF Map Lifecycle

Managing BPF maps requires a careful dance between the lifetime of the BPF object and the Rust handles that manipulate them. In Bistouri, we use libbpf-rs to manage the skeleton, but we’ve made a specific design choice regarding map access: we use a custom MapHandle.

NoteDesign Decision: Decoupled Map Ownership

A MapHandle duplicates the underlying file descriptor of a BPF map. This is critical for our architecture because it allows the TriggerAgent (which determines which processes to profile) and the CaptureOrchestrator (which manages active PID filtering) to manipulate maps independently of the ProfilerAgent’s main skeleton.

By duplicating the FD, we avoid complex borrow-checking issues where multiple components would otherwise need to hold a reference to the entire BPF skeleton. This allows us to treat the maps themselves as the “source of truth” and the primary communication bus. The pid_filter_map is a persistent control plane, while the ring buffers serve as the data plane.

3.3 Ring Buffer vs Perf Buffer

Bistouri uses the modern BPF_MAP_TYPE_RINGBUF rather than the older BPF_MAP_TYPE_PERF_EVENT_ARRAY. This was a conscious decision based on two performance factors: memory efficiency and data ordering.

Perf buffers allocate a separate memory region per CPU. If you have 128 CPUs, you have 128 buffers. This often leads to wasted memory if one CPU is idle while another is dropping samples. The modern Ring Buffer is a single, shared memory region for all CPUs. It supports high-performance reservation-based allocation, allowing us to write stack traces directly into the buffer memory (bpf_ringbuf_reserve) without extra copies.

Furthermore, the Ring Buffer preserves the temporal order of events across CPUs more effectively. Unlike older profilers that rely on a dedicated blocking thread to poll BPF maps, Bistouri uses a custom AsyncRingBuffer. This wrapper registers the ring buffer’s epoll file descriptor with the Tokio reactor. When the kernel pushes data, Tokio wakes the profiling task, which then drains the buffer. This integrates profiling naturally into the async event loop without sacrificing performance or starving the CPU.

3.4 LPM Trie for Cgroup Matching

While the current implementation uses the comm_lpm_trie primarily for command-name matching, the choice of an LPM trie (Longest Prefix Match) is strategic. In eBPF, we often need to match hierarchical or string-based data. Command names (comm) are fixed at 16 bytes, but we want the ability to match prefixes (e.g., matching all python... processes).

The LPM trie is the standard tool for this in the kernel. It allows us to store keys with a prefix length, and the BPF helper bpf_map_lookup_elem will find the most specific match. This same structure will naturally extend to cgroup paths as we evolve, where we might want to profile everything under a specific slice or sub-container.

3.5 Batch Updates & Performance

To keep the probe overhead low, we avoid performing complex logic inside the BPF program attached to the perf event. The handle_perf program is the most frequently executed code path in our system.

We’ve optimized this by moving the filtering logic to the pid_filter_map. When the userspace agent decides to “monitor” a PID, it performs a single map update. The BPF side then performs a single map lookup. We’ve considered batching these updates for high-churn environments, but given our “trigger-based” philosophy, the rate of change in the filter map is typically low enough that individual updates are sufficient and provide lower latency between a trigger event and the start of profiling.

3.6 Error Recovery

One of the most overlooked aspects of eBPF design is “telemetry for the profiler itself.” If a stack walk fails in the kernel, or if we run out of space in a ring buffer, how does the operator know?

We implemented a dedicated errors ringbuffer. Inside the BPF programs, we use a tagged union (ErrorEvent) to report different failure modes: 1. ERR_STACK_FETCH: If bpf_get_stack returns a negative code (e.g., because the process exited while the stack was being walked). 2. ERR_RESERVE_STACK_RINGBUF: If the high-volume stack event ring buffer is full and we dropped a sample. 3. ERR_RESERVE_TRIGGER_RINGBUF: If the trigger event ring buffer is full.

In userspace, these are caught, decoded, and logged with actionable advice (e.g., “increase ring buffer size”). This allows us to distinguish between “there is no activity” and “the profiler is failing to capture data.” The use of a C union in the BPF side, mapped to a Rust enum in the agent, gives us a type-safe way to handle these heterogeneous error events without the overhead of multiple maps.

3.7 Agent Lifecycle

The lifecycle of the agent is managed through a strict sequence of loading, attaching, and polling. We use Rust’s ownership model to ensure that BPF links (which keep the programs attached) are not dropped prematurely.

stateDiagram-v2
    [*] --> Unloaded : "build()"
    Unloaded --> Loaded : "load()"
    Loaded --> Attached : "attach_perf_events()"
    Attached --> Polling : "start_polling()"
    Polling --> Attached : "cancel.cancel()"
    Attached --> [*] : "Drop"

    note right of Loaded : "Maps created, PROG_LOAD called"
    note right of Attached : "Progs linked to CPU_CLOCK"
    note right of Polling : "Tokio task running AsyncRingBuffer"

The LoadedProfilerAgent struct acts as a guard for the active profiling session. When the struct is dropped, the BPF links and the ring buffer are cleaned up in a specific sequence. This prevents use-after-free-like scenarios where the kernel might try to push data into a ring buffer that userspace has already unmapped.


Auto-generated from commit 02320e5 by Gemini 3.1 Pro. Last updated: 2026-05-10