Observability at Nanosecond Resolution

Most observability tools add more latency than the bugs they help you find. We built our own.

By Anokuro Engineering·Mar 6, 2026·Engineering

We have a rule at Anokuro: if your observability tooling adds more latency than the bugs it helps you find, you have made your system worse. This sounds obvious. It is not how most companies operate.

We measured the overhead of every major commercial APM tool on our ad-serving hot path. The results were disqualifying.

The Problem With Commercial APM

We tested Datadog, New Relic, Dynatrace, Honeycomb, and Grafana Cloud's agent-based instrumentation on our bid-serving infrastructure. The methodology was straightforward: run our standard benchmark suite (200k req/s sustained, measuring p999 latency) with and without each agent installed.

The results:

Datadog APM agent: 4.7% latency increase at p99, 11.2% at p999
New Relic: 3.1% at p99, 8.8% at p999
Dynatrace OneAgent: 6.4% at p99, 14.9% at p999
Honeycomb (OpenTelemetry SDK): 2.3% at p99, 5.7% at p999
Grafana Alloy: 2.8% at p99, 7.1% at p999

Our total latency budget is 10ms. Our p999 target is 8ms. A 14.9% overhead at p999 means Dynatrace alone would cost us 1.2ms — more than half of AnokuroDB's entire read latency budget. Even Honeycomb at 2.3% p99 overhead adds 180 microseconds to every request at the tail. We serve 200,000 requests per second. That overhead is not free. It is revenue.

The root cause is the same across all of them: runtime instrumentation. These tools inject function wrappers, intercept system calls, and allocate memory on the hot path. They were designed for web applications where 200ms response times are acceptable. Our budget is 50x tighter.

Compile-Time Instrumentation in Zig

We built our own tracing library for Zig that uses comptime to eliminate all runtime cost when tracing is disabled.

The core idea is simple. In Zig, comptime blocks are evaluated at compile time and produce no runtime code unless the result is used. We define trace points as comptime-conditional function calls:

pub fn tracePoint(comptime name: []const u8, value: u64) void {
    if (comptime trace_config.isEnabled(name)) {
        writeToRingBuffer(name, value, rdtsc());
    }
}

When tracing is disabled for a given trace point, the entire function body is eliminated by the compiler. Not "optimized away by LLVM hopefully." Eliminated. The generated assembly contains zero instructions for disabled trace points. We verified this by disassembling every hot path function.

When tracing is enabled, the cost is exactly one rdtsc instruction (7-25 cycles depending on microarchitecture), one write to a thread-local ring buffer, and no system calls, no allocations, no locks.

The trace configuration is a compile-time constant, which means you cannot enable or disable tracing at runtime without recompilation. This is a deliberate constraint. Runtime-configurable tracing requires a branch on the hot path. That branch pollutes the branch predictor's state for every request, not just traced ones. We measured the cost of a single unpredictable branch on our hot path at 3-7 nanoseconds per request. Over 200k req/s, that is 600 microseconds to 1.4 milliseconds of wasted CPU time per second — for a feature flag check.

We ship two builds: one with tracing disabled (production default) and one with tracing enabled (deployed to a shadow fleet that receives mirrored traffic). The shadow fleet gives us full observability without adding a single nanosecond to production latency.

The Collection Pipeline

Trace data from enabled instances flows through a three-stage pipeline:

Stage 1: Thread-Local Ring Buffers. Each worker thread writes trace events to a pre-allocated, lock-free ring buffer. The buffer is sized at 64KB per thread (enough for approximately 2,000 trace events before wrapping). There is no synchronization between threads. There is no contention. If the buffer fills before the exporter drains it, old events are silently overwritten. We lose data rather than block the hot path. This is the correct tradeoff for ad serving.

Stage 2: Batched UDP Export. A dedicated exporter thread (pinned to its own core, running at lower priority) reads from all ring buffers and batches trace events into UDP datagrams. We use UDP, not TCP, because we refuse to let backpressure from the collection pipeline affect the serving path. If the aggregation service is slow or unreachable, datagrams are dropped. The exporter thread constructs 8KB datagrams (MTU-safe with jumbo frames on our internal network) containing 200-400 trace events each, serialized with our custom binary format at approximately 20 bytes per event.

The exporter runs at 100ms intervals. At 200k req/s with 6 trace points per request, that is roughly 120,000 events per batch, fitting in approximately 400 datagrams. Total network overhead: 3.2MB/s per serving instance. Negligible on 25GbE.

Stage 3: Gleam Aggregation Service. The trace datagrams arrive at our aggregation service, written in Gleam running on the BEAM. This is where the real computation happens: parsing the binary trace format, maintaining sliding-window histograms for each trace point, computing streaming percentiles (p50, p99, p999, p9999) using a t-digest data structure, and persisting aggregated metrics to AnokuroDB for historical analysis.

We chose Gleam for the aggregation service because it must stay alive. Trace data arrives in bursts. Network partitions happen. Individual serving instances restart. The BEAM's supervision trees and per-process garbage collection mean that a malformed datagram crashes one handler process, not the entire aggregator. The process restarts in microseconds and continues processing the next datagram. We have never had to restart the aggregation service manually. It has been running for four months.

The Gleam aggregator handles 12 million trace events per second across our fleet. CPU utilization sits at 23% on a 16-core machine. The BEAM's scheduler distributes work across cores without any manual thread management. We spawn one lightweight process per serving instance (currently 48 instances), and each process maintains its own t-digest state. No shared mutable state. No locks. No data races.

Custom Dashboards

Our dashboards are built with TanStack Start and Recharts. This was not a complex engineering decision — it is the same stack we use for everything else at Anokuro that needs a web interface.

The dashboard connects to the Gleam aggregator via WebSocket and receives pre-computed percentile updates at 1-second resolution. We render four primary charts per service:

Latency heatmap: p50, p90, p99, p999, and p9999 as stacked bands over time. This immediately reveals whether a latency regression affects the median or only the tail.
Trace point breakdown: per-function latency contribution, showing exactly which stage of the request pipeline is responsible for tail latency spikes.
Throughput overlay: requests per second correlated with latency, making it trivial to distinguish load-induced latency from code-induced latency.
Ring buffer saturation: percentage of events dropped per instance. If this exceeds 5%, we either need more trace points disabled or a faster exporter.

The dashboards are server-rendered with zero client-side JavaScript for the initial paint. Recharts hydrates for interactivity (zoom, hover tooltips, time range selection), but the critical information is visible before any JS loads. An engineer triaging a latency incident at 3am does not need to wait for a JavaScript bundle to see if the system is on fire.

Why We Open-Sourced the Zig Tracing Library

The Zig tracing library (zig-nanotrace) is open source as of February 2026. The aggregation service and dashboards are not.

The reasoning: the tracing library solves a universal problem. Every Zig project that cares about performance needs zero-overhead instrumentation, and the existing options (manually inserting std.time.Timer calls, using the built-in tracing allocator) are either high-overhead or limited in scope. By open-sourcing zig-nanotrace, we improve the Zig ecosystem, attract contributors who find and fix edge cases we have not encountered, and establish Anokuro as a credible Zig shop for recruiting purposes.

The aggregation pipeline is not open-sourced because it is tightly coupled to our specific operational requirements and would need significant generalization work to be useful outside our infrastructure. We would rather ship a focused, high-quality tracing library than a half-baked end-to-end observability platform.

The Result

Our production serving fleet runs with zero observability overhead. Our shadow fleet provides full nanosecond-resolution tracing at a cost of 3.2MB/s of network traffic per instance and one additional CPU core per machine for the exporter thread.

When we investigate a latency regression, we have per-function timing data at nanosecond resolution, correlated across every stage of the request pipeline, with one-second granularity, going back 90 days. We can answer questions like "did the p999 of our frequency capping range scan increase after the last AnokuroDB compaction strategy change?" in under 30 seconds.

The commercial APM tools we evaluated cannot answer that question at all. They do not have the resolution. And they would cost us 2-15% of our latency budget for the privilege of not being able to answer it.

We measure everything. We pay for nothing on the hot path. That is the only acceptable tradeoff.