The Cost of Every Abstraction
We measured the nanosecond cost of every abstraction in our stack. Most of them were not worth it.
There is a lie that programmers tell themselves: "This abstraction is free." No abstraction is free. Some are cheap. Some are so expensive that they consume more CPU time than the business logic they are abstracting. We know this because we measured every one of them.
At Anokuro, we serve 200,000+ ad auctions per second. Our total latency budget is 10 milliseconds. When your budget is 10ms and you are doing it 200,000 times per second, every nanosecond is a line item. We treat nanoseconds the way accountants treat dollars: every one must be justified.
The Methodology
We do not trust microbenchmarks. A tight loop running the same function 10 million times tells you what the CPU cache and branch predictor can do, not what your production code does. Our measurement methodology combines three approaches:
perf stat on production traffic. We run perf stat -e cycles,instructions,cache-misses,branch-misses on our production binaries under real traffic. This gives us the actual cycle counts, not synthetic ones. We record these metrics continuously and track them per-commit in our CI pipeline. Any commit that increases cycles-per-request by more than 2% is flagged for review.
Zig's built-in profiler with @timer. In development builds, we instrument critical sections with Zig's std.time.Timer. The overhead of the timer itself is 18ns on our hardware (measured by timing an empty section 100 million times and computing the mean). We subtract this from all measurements.
Custom instrumentation via RDTSC. For sub-nanosecond measurements, we read the CPU timestamp counter directly. This gives us cycle-accurate measurements with roughly 0.3ns resolution on our 3.4GHz hardware. We use this for comparing individual instructions and small code sequences.
All numbers in this post are from production hardware: AMD EPYC 9454 processors, DDR5-4800 ECC memory, running Linux 6.8 with performance governor pinned to maximum frequency. Turbo boost is disabled because it introduces measurement variance.
Virtual Dispatch vs. Comptime Dispatch
This is the biggest win we found. Our storage engine has a Compactor interface with multiple implementations (time-window compaction, leveled compaction, size-tiered compaction). The original design used virtual dispatch via function pointers in a vtable, the standard approach in C and C++.
Virtual dispatch (function pointer through vtable):
- Mean latency: 3.2ns
- P99 latency: 8.1ns
- The P99 spike is caused by vtable cache misses. The vtable is 64 bytes (8 function pointers). When the L1 cache line containing the vtable is evicted (which happens under memory pressure from concurrent requests), the CPU stalls for a cache fetch.
Comptime dispatch (Zig comptime with inline for):
- Mean latency: 0.4ns
- P99 latency: 0.6ns
- No vtable. No indirection. The compiler generates a specialized function for each compaction strategy. The dispatch is a direct call or, more often, fully inlined.
The difference is 2.8ns per call on the mean. Our compactor is called 847 times per bid request on average (once per SSTable candidate). That is 2,371ns saved per request, or 2.3 microseconds. At 200,000 requests per second, that is 474 milliseconds of CPU time saved per second. On a 128-core machine, that is 0.37% of total compute. It adds up.
We converted every virtual dispatch site in the hot path to comptime dispatch. Total savings: 11.2 microseconds per request, or 2.24 CPU-seconds per second across all requests. We reclaimed 1.75% of our total CPU budget from this single optimization category.
JSON Parsing: The Silent Killer
Our original protocol between the bid orchestration layer (Gleam) and the storage engine (Zig) used JSON over HTTP. This was the prototype. It should not have survived as long as it did.
We measured the cost of JSON serialization and deserialization for a typical bid-request context payload (user segments, frequency caps, campaign eligibility flags). The payload is 2.3KB as JSON.
JSON (using simdjson, the fastest JSON parser in existence):
- Serialize: 1,840ns
- Deserialize: 920ns
- Total round-trip: 2,760ns
- Payload size: 2,300 bytes
Binary protocol (custom, schema-driven, generated at comptime in Zig):
- Serialize: 89ns
- Deserialize: 41ns
- Total round-trip: 130ns
- Payload size: 847 bytes
The binary protocol is 21x faster and 2.7x smaller. simdjson is a marvel of SIMD engineering and it is still 21x slower than not parsing JSON at all.
The payload size reduction matters as much as the CPU savings. At 200,000 requests/second, the JSON protocol consumed 460MB/s of internal network bandwidth. The binary protocol consumes 169MB/s. On a 25GbE network shared with replication traffic, that bandwidth difference is the difference between 14.7% and 5.4% utilization. Lower utilization means lower tail latency on the network, which means lower P99 for everything.
We designed the binary protocol to be zero-copy where possible. Fixed-size fields (integers, floats, booleans) are read directly from the buffer without copying. Variable-length fields (strings, byte arrays) are stored with a 4-byte length prefix and accessed as slices into the original buffer. The deserialization function returns a struct of pointers into the input buffer, not a struct of owned copies.
HashMap Lookup: Open Addressing vs. Chaining
Our user-segment lookup table maps segment IDs (64-bit integers) to segment metadata. We tested two implementations:
std.HashMap (Zig's standard library, open addressing with Robin Hood hashing):
- Lookup (hit): 28ns
- Lookup (miss): 22ns
Custom array-backed hash map (open addressing, linear probing, power-of-two size):
- Lookup (hit): 11ns
- Lookup (miss): 8ns
The standard library hash map is general-purpose. It handles arbitrary load factors, supports arbitrary hash functions, and provides good performance for diverse workloads. Our workload is not diverse. Our keys are 64-bit integers with good entropy. Our load factor is fixed at 0.5 (we know the dataset size at startup). Linear probing with power-of-two sizing gives us cache-friendly sequential reads on probe chains, and the low load factor keeps chains short.
The 17ns savings per lookup matters because we do 12-30 segment lookups per bid request. At 20 lookups average, that is 340ns per request, or 68 milliseconds of CPU time per second across all traffic.
React Reconciliation at Scale
Our ad-serving dashboard displays real-time bidding data. One view shows a table of 50,000 active campaigns with live-updating metrics (impressions, clicks, spend, CTR, CPC). Every 2 seconds, we receive a WebSocket update with delta changes for approximately 3,000 campaigns.
Naive approach (update state, let React reconcile):
- Reconciliation time: 340ms for 50,000 rows
- UI thread blocked for 340ms
- Visible jank, missed frames, scrolling stutters
Virtualized table with surgical updates (TanStack Virtual + mutable ref for data, React only renders visible rows):
- Reconciliation time: 2.1ms for ~40 visible rows
- Delta application to backing data: 0.8ms for 3,000 updates
- Total: 2.9ms, no visible jank
The insight is that React's reconciliation is O(n) in the number of elements in the virtual DOM tree. If you put 50,000 rows in the tree, React diffs 50,000 rows, even if only 3,000 changed. Virtualization reduces n from 50,000 to ~40 (the number of visible rows). The delta updates are applied directly to a mutable ref that React does not track, so they do not trigger reconciliation at all. Only the visible rows re-render, and only when the viewport changes or a visible row's data changes.
We considered moving to a non-React framework (Solid, Svelte) for this view. We decided against it because the virtualization approach solves the problem within React, and the rest of our 247k-line frontend codebase is React. Rewriting one view in a different framework creates a maintenance burden that exceeds the performance benefit.
The Philosophy
Every abstraction in our codebase must answer one question: what does this cost, and is it worth it?
An abstraction that costs 0.4ns and saves an engineer 20 minutes of debugging per month is worth it. An abstraction that costs 3.2ns and saves 5 minutes of typing is not. The calculation is not always this clean, but the principle is: measure the cost in nanoseconds, measure the benefit in engineering time, and do the division.
We track a metric we call "nanoseconds per abstraction" (NPA) for every layer in our stack. It is a table in our internal wiki with three columns: the abstraction, its measured cost, and the engineering justification for that cost. If an abstraction cannot justify its nanoseconds, it gets replaced with a lower-cost alternative or removed entirely.
This is not premature optimization. Premature optimization is guessing which code is slow and optimizing it without measurement. We measure first. We optimize only what the data says is expensive. The difference is that we measure everything, so we catch the costs that other teams never notice because they never looked.
The result: our ad-serving stack processes 200,000 requests per second on 40% fewer cores than our initial architecture. That is not 40% less code or 40% fewer features. It is 40% less hardware doing the same work, because we refused to accept the nanosecond costs that most teams never bother to question.
Abstractions are not free. Measure them. Make them justify their existence. Your P99 latency will thank you.