Tail Latency Is the Only Metric That Matters
Your p50 is a lie. Your p99 is optimistic. We optimize for p999 because that is where revenue dies.
Every monitoring dashboard in the industry defaults to showing you average response time. This metric is useless. It is worse than useless — it is actively misleading. An average response time of 3ms can hide a p999 of 120ms, and if you are serving programmatic advertising at scale, that 120ms tail is where your revenue goes to die.
We optimize for p999 at Anokuro. Not p50. Not p95. Not even p99. p999. This is not perfectionism. It is arithmetic.
The Math That Changes Everything
We serve 200,000 bid requests per second during sustained peak traffic. At this volume, tail latencies are not edge cases. They are steady state.
- p99 (1 in 100 requests is slower): 2,000 slow requests per second
- p999 (1 in 1,000 requests is slower): 200 slow requests per second
- p9999 (1 in 10,000 requests is slower): 20 slow requests per second
At p999, 200 requests per second exceed the latency threshold. Not per day. Not per hour. Per second. Over the course of a day, that is 17.28 million requests that experienced tail latency. Each of those requests is a bid for ad inventory. Each bid that arrives late is a bid that loses the auction.
Here is the part most engineers do not internalize: in a real-time bidding auction, there is no second chance. If your bid response arrives after the exchange's deadline (typically 100ms, with many exchanges enforcing 80ms), your bid is discarded. You do not lose by being the lowest bidder. You lose by not bidding at all. The revenue impact is binary: you either bid in time, or you earn zero on that impression.
Our p999 directly determines what percentage of the available ad inventory we can effectively compete for. When our p999 was 47ms, our effective bid rate (bids submitted within the exchange deadline) was 99.82%. When we reduced p999 to 8ms, our effective bid rate became 99.97%. That 0.15 percentage point improvement, applied to our daily impression volume, represents a revenue increase that justifies the entire engineering team's annual cost.
Why Averages and Medians Lie
The arithmetic mean of a latency distribution is dominated by the bulk of fast requests and hides the tail. A service with p50 of 1ms and p999 of 200ms has a mean of approximately 1.2ms. The mean tells you the system is fast. The tail tells you 200 requests per second are 200x slower than the median.
The median (p50) is better than the mean but still insufficient. Our p50 is 1.1ms. It has been 1.1ms for six months. It barely changes when we deploy new code, when traffic patterns shift, or when hardware degrades. The p50 is the system's happy path, and the happy path almost never breaks.
The tail is where everything breaks. The tail is where garbage collection pauses manifest. Where lock contention spikes. Where CPU cache thrashing occurs. Where the kernel schedules your thread off-core at the wrong moment. Where NVMe write latency spikes during internal garbage collection. Where TCP retransmits add 200ms because a single packet was dropped.
We do not look at p50 dashboards. We do not alert on p50. Our primary monitoring view shows p99, p999, and p9999, with the p999 trace in bold. Everything else is noise.
Techniques That Actually Move the Tail
Most "performance optimization" advice targets the median. Use a faster serializer. Add a cache. Optimize your SQL queries. These are fine for p50. They barely move p999. The tail is caused by systemic issues, not algorithmic ones. Here is what actually works:
Hedged Requests
When a request to AnokuroDB exceeds 1ms (our p90 threshold), we immediately fire a duplicate request to a different replica. We take whichever response arrives first. This adds load — roughly 10% more queries — but it eliminates a category of tail latency caused by transient issues on a single replica (compaction stalls, cache misses, kernel scheduling delays).
The implementation is straightforward in Gleam. Our ad orchestration service spawns the hedged request as a separate BEAM process with a select that resolves on the first reply:
pub fn hedged_read(key: Key, timeout_us: Int) -> Result(Value, ReadError) {
let primary = process.async(fn() { db.read(key, replica: Primary) })
let hedge = case process.try_await(primary, timeout_us) {
Ok(result) -> result
Error(Timeout) -> {
let secondary = process.async(fn() { db.read(key, replica: Secondary) })
process.select_first([primary, secondary])
}
}
hedge
}
Hedged requests reduced our p999 read latency from 3.2ms to 1.8ms. The 10% additional query load is a cost we pay happily.
Pre-Computed Fallbacks
For every bid request, we pre-compute a fallback response during idle CPU cycles. The fallback uses cached user segment data and a simplified auction algorithm that can execute in under 500 microseconds. If the primary bid computation exceeds our deadline, we return the fallback instead of timing out.
The fallback bid is suboptimal — it does not use real-time features like recency-weighted segment scores — but a suboptimal bid is infinitely better than no bid. In practice, the fallback path activates on approximately 0.08% of requests. The revenue from those fallback bids is pure upside that we would have lost entirely without this mechanism.
Aggressive Timeouts
Every external call in our serving path has a timeout, and every timeout is tighter than you would expect. AnokuroDB reads: 2ms. Segment enrichment service: 3ms. Total request budget: 10ms. If any stage exceeds its budget, we degrade gracefully rather than waiting.
Most systems set timeouts at the p99 or p999 of the downstream service. We set them at approximately p99.5 and rely on fallback paths for the rest. This means we intentionally abort a small number of requests that would have succeeded if we had waited. That is the correct tradeoff. Waiting for a slow response ties up a worker thread, increases queuing delay for subsequent requests, and causes cascading tail latency. Failing fast and falling back keeps the tail bounded.
Infrastructure Choices Driven by Tail Latency
Our infrastructure decisions are made through the lens of p999, not throughput or cost.
Thread-per-core, no sharing. Our Zig-based serving processes pin one worker thread to each physical CPU core. There is no thread pool. There is no work stealing. A request that arrives on core 7 is processed entirely on core 7. This eliminates cross-core cache line bouncing, which we measured at 40-80ns per occurrence. At 6 potential cache line bounces per request, that is up to 480ns of jitter eliminated.
Pre-allocate all memory at startup. Our serving processes allocate all memory during initialization: connection buffers, request parsing buffers, response serialization buffers, and the local segment cache. After startup, the hot path performs zero heap allocations. malloc is non-deterministic. It can take anywhere from 20ns to 50 microseconds depending on fragmentation, lock contention, and whether the kernel needs to map new pages. We refuse to put a non-deterministic operation on the hot path.
No garbage collection on the critical path. This is why the serving layer is Zig, not Gleam or TypeScript. The BEAM's per-process GC is excellent — but "excellent GC" is still GC. A 200-microsecond GC pause is invisible in most systems. In ours, it is 2% of the total latency budget. The Gleam orchestration layer handles coordination where latency budgets are larger (tens of milliseconds), and the Zig serving layer handles the microsecond-sensitive path.
Kernel bypass for networking. Our serving instances use io_uring for all network I/O, bypassing the kernel's socket buffers and reducing system call overhead. A standard recv system call costs 1-3 microseconds including the context switch. io_uring submission costs approximately 200 nanoseconds with batch submission. At 200k req/s, this saves 160-560 milliseconds of CPU time per second per core.
The Case Study: 47ms to 8ms
In September 2025, our p999 serving latency was 47ms. Today it is 8ms. This is how the improvements decomposed:
| Change | p999 Impact | |--------|-------------| | Thread-per-core pinning | 47ms → 38ms | | Pre-allocated memory (zero malloc on hot path) | 38ms → 29ms | | Hedged reads to AnokuroDB | 29ms → 21ms | | io_uring kernel bypass | 21ms → 16ms | | Pre-computed fallback responses | 16ms → 11ms | | Aggressive timeout + fallback at every stage | 11ms → 8ms |
No single change was responsible. Tail latency is not caused by one bottleneck. It is caused by the accumulation of dozens of small sources of non-determinism. Eliminating each one shaves milliseconds. The total is transformative.
The revenue impact of moving from 47ms to 8ms p999 is the single largest performance-driven revenue improvement in Anokuro's history. We cannot share the exact number, but it funded the entire AnokuroDB project and the engineering headcount to support it.
The Discipline
Optimizing for p999 requires a different engineering discipline than optimizing for throughput. Throughput optimization asks "how do we handle more requests?" p999 optimization asks "why is this specific request slow?" The first question has architectural answers. The second has forensic answers. You must trace individual slow requests through every layer of the system and understand the precise sequence of events that caused each one.
This is harder. It is also more valuable. Your p50 serves the customers who were going to be happy anyway. Your p999 determines whether you can compete for the impressions that everyone else's systems are too slow to bid on.
Tail latency is not a metric. It is the metric. Everything else is a vanity number.