Why Every Ad Platform Wastes 80% of Its Compute

The ad-tech industry runs on JVM bloat, Python inference, and architectures designed for organizational convenience rather than performance.

By Anokuro Engineering·Mar 5, 2026·Ad Tech

In 2025, we spent three months auditing the compute profiles of three major ad platforms. Two were platforms we considered buying instead of building. One was a partner whose infrastructure we integrated with closely enough to measure.

The finding that changed our roadmap: more than 80% of CPU cycles on the critical bid-serving path were spent on overhead, not on the actual work of deciding which ad to show. Serialization. Garbage collection. Network hops between microservices. TLS handshakes between components that run on the same physical machine. Protobuf encoding and decoding the same data four times as it passes through four services.

This is not an exaggeration. This is measurement. And it explains why most ad platforms need 50-200 servers to handle traffic that should run on 10.

The Audit

We profiled the bid path of each platform using a combination of eBPF-based CPU profiling, network tap analysis for inter-service traffic measurement, and heap dump analysis for memory overhead. The platforms were cooperative (two were evaluating us as a technology partner). We ran the profiling under production-representative load: 100,000 bid requests per second with real-world bid request payloads from OpenRTB 2.6 traffic.

Platform A was a JVM stack (Kotlin on the JVM, Kafka, gRPC, Redis). Platform B was Go services with protobuf everywhere. Platform C was the classic ad-tech chimera: Java bid server, Python ML inference, Node.js creative selection, PostgreSQL for campaign data, Redis for caching, Kafka for event streaming.

The CPU breakdown on the bid-serving hot path, averaged across all three:

| Activity | % of CPU Time | |---|---| | Serialization/deserialization | 31% | | Garbage collection | 18% | | Network stack (inter-service) | 14% | | TLS/encryption (internal) | 8% | | Memory allocation overhead | 11% | | Actual bid logic | 18% |

Eighteen percent of compute spent on the actual job. The rest is infrastructure tax.

Serialization: The Silent Killer

Every ad platform we examined uses protobuf or JSON on the bid path. Usually both: JSON from the SSP, protobuf between internal services, JSON back to the SSP. Each serialization boundary costs real CPU time.

An OpenRTB 2.6 bid request is typically 2-8KB of JSON. Parsing that JSON in a JVM-based service using Jackson takes 40-80 microseconds. That sounds small. It is not. At 200,000 requests per second, JSON deserialization alone consumes 8-16 CPU-seconds per second. On a 64-core machine, that is 12-25% of one core just for initial parsing.

But it gets worse. Platform C parses the same bid request JSON in the Java bid server, serializes it to protobuf, sends it to the Python ML service, which deserializes the protobuf, runs inference, serializes the response to protobuf, sends it back to Java, which deserializes it, then serializes a different protobuf to the creative selection service. The same fundamental data is serialized and deserialized six times before a bid response is produced.

We measured the cumulative serialization cost on Platform C: 2.1ms of CPU time per bid request. Their total bid latency was 6.8ms. Thirty-one percent of the time, the machine is converting data between wire formats instead of making decisions.

Our approach: the bid request is parsed from JSON once at the network edge. After that, it exists as a native Zig struct in shared memory. Every component that needs bid request data reads it directly from the struct. No serialization. No deserialization. No copies. The struct is laid out in memory to match the access pattern of our bid logic, with hot fields (publisher ID, geo, device type, floor price) in the first cache line.

Total serialization cost in our pipeline: one JSON parse at ingestion (12 microseconds in our SIMD-accelerated parser) and one JSON serialization for the bid response (8 microseconds). Twenty microseconds total, versus 2,100 microseconds on Platform C.

Garbage Collection: Paying Rent on Memory You Already Own

The JVM's garbage collector is a marvel of engineering. It is also catastrophically wrong for real-time bid serving.

We instrumented Platform A's JVM with GC logging at millisecond resolution during a 4-hour peak traffic window. The results:

Young GC pauses: 1.2ms average, occurring every 800ms. Individually harmless.
Mixed GC pauses: 8-15ms, occurring every 45 seconds. Each one blows the P99 latency budget.
Full GC pauses: 45-120ms, occurring 3 times per hour. Each one causes thousands of bid requests to timeout.

During a single full GC pause of 87ms at 200k req/s, approximately 17,400 bid requests are either queued (adding latency) or dropped. At a $2 CPM, each dropped impression has an expected value of $0.002. 17,400 dropped impressions per GC pause, 3 pauses per hour, 24 hours: roughly $2,500/day in lost revenue from garbage collection alone.

Platform A's engineers knew this. They had spent months tuning GC parameters: heap size, GC algorithm (ZGC), concurrent marking thresholds. They got the full GC frequency down from 12/hour to 3/hour. But they cannot eliminate GC pauses because the JVM requires garbage collection. It is not optional.

Go is better but not exempt. Platform B's Go services showed P99 GC pauses of 0.5ms with the latest Go 1.23 runtime. Individually acceptable. But they ran 8 Go services on the bid path, and GC pauses are uncorrelated between services. The probability of at least one service experiencing a GC pause during any given bid request was 11% under load. Eleven percent of bid requests hit at least one GC pause somewhere in the pipeline.

We do not have a garbage collector. Zig gives us explicit memory management. Request-scoped memory uses arena allocators that are freed in a single operation when the bid response is sent. Long-lived state uses pool allocators with deterministic lifetimes. There are zero GC pauses. Not low. Zero. Our P99 latency is determined by actual work, not by the runtime deciding to pause and clean up.

The Microservice Tax

Platform C had 9 services on the critical bid path:

Load balancer
Request router
Bid request validator
User segment lookup
ML feature extraction
ML inference
Campaign selection
Creative optimization
Bid response formatter

Each hop between services involves: a TCP connection (or connection pool management), TLS handshake (or session resumption), serialization of the request, network transit (even on the same machine, loopback adds 20-50 microseconds), deserialization of the request, processing, serialization of the response, network transit back, deserialization of the response.

We measured the per-hop overhead at 180-350 microseconds depending on payload size. Nine services means 8 inter-service hops. Overhead: 1.4-2.8ms per bid request. On a 6.8ms total latency, that is 20-41% spent on services talking to each other.

This architecture exists because of organizational structure, not technical requirements. Nine teams own nine services. Each team deploys independently. Each team chose their own language and framework. The architecture reflects the org chart, as Conway's Law predicts.

We run the entire bid pipeline in a single process. User segment lookup, feature extraction, inference, campaign selection, creative optimization, and response formatting all happen as function calls within one Zig binary sharing one address space. Inter-"service" communication is a function call: 2 nanoseconds, not 200 microseconds. A hundred thousand times faster.

The tradeoff is real: we cannot deploy the ML model independently of the bid server. We cannot scale the inference layer separately from the routing layer. We accept these tradeoffs because the performance benefit is overwhelming and our team is small enough that organizational boundaries do not dictate architecture.

What This Means in Practice

Our production bid server handles 200,000 requests per second on 4 servers. Each server is a bare-metal machine with 32 cores and 128GB RAM. Total compute: 128 cores.

Platform C handles similar traffic on 120 servers (a mix of instance types, but roughly equivalent to 480 cores for the bid path alone). They use 3.75x more compute to do the same job.

Our infrastructure cost for bid serving: approximately $14,000/month. An equivalent deployment using Platform C's architecture on the same cloud provider: approximately $58,000/month. The delta is $528,000/year.

This is not because we are smarter. It is because we made different architectural decisions: no garbage collector, no serialization between components, no microservice overhead, no interpreted languages on the hot path. Each decision individually saves 10-30% of compute. Combined, they compound multiplicatively.

The Industry Will Not Change

We do not expect the ad-tech industry to adopt this approach. The incentive structure prevents it. Most ad platforms are built by large organizations where team autonomy, independent deployability, and language choice freedom are valued more highly than raw efficiency. Microservices exist because they solve organizational problems, even though they create performance problems.

And for most platforms, the 80% waste is absorbed into infrastructure budgets that nobody scrutinizes because the margins are good enough. When you are making 30% margins on $500M revenue, spending an extra $5M on servers is a rounding error.

We operate differently because we have to. We are a 12-person company competing against platforms backed by billions in funding. We cannot afford to waste 80% of our compute. Every dollar of infrastructure cost comes directly from our runway. Efficiency is not a philosophy for us. It is survival.