Zero-Allocation HTTP Parsing in Zig

Our HTTP parser allocates exactly zero bytes on the heap. Here is how and why that matters for ad serving.

By Anokuro Engineering·Feb 7, 2026·Infrastructure

Our ad serving infrastructure handles 200,000 HTTP requests per second per node. At that volume, every allocation in the request path is a tax. Memory allocators take locks. Locks cause contention. Contention causes tail latency. Tail latency costs money -- a bid that arrives 2 milliseconds late is a bid that loses.

So we wrote an HTTP parser that allocates zero bytes on the heap. Not "almost zero." Not "zero in the common case." Zero. Always. Here is how.

Why HTTP Parsing Is on the Critical Path

An ad bid request arrives as an HTTP POST with a JSON body. The naive processing pipeline looks like this:

Read bytes from socket into kernel buffer
Copy to user-space buffer (first copy)
Parse HTTP headers, allocating strings for header names and values (first allocation)
Copy body into a separate buffer (second copy)
Parse JSON body, allocating strings for keys and values (second allocation)
Process the bid
Serialize response
Write to socket

Steps 2 through 5 are pure overhead. They exist because general-purpose HTTP libraries assume you want owned, heap-allocated strings that outlive the parse phase. But in ad serving, we process the request and discard it within the same event loop iteration. Nothing outlives the request. Every allocation is wasted work.

The State Machine: Slices, Not Strings

Our parser is a hand-written state machine that operates on a borrowed byte slice and produces pointer-and-length pairs pointing back into the original buffer. It never copies. It never allocates.

const HeaderField = struct {
    name: []const u8,  // points into input buffer
    value: []const u8, // points into input buffer
};

const ParseResult = struct {
    method: []const u8,
    path: []const u8,
    headers: [MAX_HEADERS]HeaderField,
    header_count: u8,
    body: []const u8,
};

pub fn parse(input: []const u8) ParseError!ParseResult {
    var result: ParseResult = undefined;
    var pos: usize = 0;

    // Method
    const method_end = std.mem.indexOfScalar(u8, input[pos..], ' ') orelse return error.InvalidMethod;
    result.method = input[pos..][0..method_end];
    pos += method_end + 1;

    // ... continues for path, version, headers, body
}

The ParseResult struct is 2,088 bytes on the stack. It contains up to 64 header fields, each of which is two slices (pointer + length). None of these slices own their data. They are views into the original input buffer. When the function returns, the caller has a complete parse result with zero heap allocations.

The maximum header count is 64, hardcoded. If a request has more than 64 headers, we return error.TooManyHeaders. This is a deliberate constraint. HTTP requests with more than 64 headers are either misconfigured or malicious. We do not pay a runtime cost to handle them gracefully.

Edge Cases Without Allocations

The hard part of zero-allocation parsing is not the happy path. It is the edge cases.

Chunked transfer encoding. Chunked encoding is the reason most HTTP parsers allocate -- you do not know the body size upfront, so you need a growable buffer. We solve this differently. Our parser does not reassemble chunks. It returns a ChunkedBody struct that contains an iterator over chunk boundaries within the original buffer:

const ChunkedBody = struct {
    input: []const u8,
    positions: [MAX_CHUNKS]ChunkPosition,
    chunk_count: u16,

    pub fn iterate(self: *const ChunkedBody) ChunkIterator {
        return .{ .body = self, .index = 0 };
    }
};

The downstream JSON parser accepts a ChunkIterator and reads across chunk boundaries transparently. No reassembly. No allocation. The chunks stay where they are in the input buffer, and consumers iterate over them.

HTTP pipelining. When multiple requests arrive in a single TCP read, the parser returns the first complete request and the byte offset where the next request begins. The caller re-invokes the parser on input[offset..]. No buffering of partial requests between calls -- the I/O layer handles that with a ring buffer that we will describe below.

Malformed requests. We do not allocate error context. Parse errors are a Zig error union (error.InvalidMethod, error.MalformedHeader, error.BodyTooLarge) plus the byte offset where parsing failed, stored in an out-parameter. In debug builds, we log the surrounding 64 bytes for diagnosis. In production, we return 400 and move on. Spending cycles on error reporting for malformed requests is a gift to attackers.

Header values with continuations. RFC 7230 deprecated header line folding, and we do not support it. Any request with a continuation line gets error.DeprecatedLineFolding. We checked our production traffic over 30 days: exactly zero legitimate requests used line folding. 14,000 malformed ones did. Dropping support was a security improvement.

Benchmarks: Parsing Throughput

We benchmark our parser against three established implementations:

picohttpparser (C): The fastest widely-used HTTP parser. Used by H2O server.
llhttp (C): Node.js's HTTP parser. Successor to http_parser.
std.http (Zig stdlib): Zig's standard library HTTP implementation.

Test: Parse a realistic ad bid request (412 bytes, 11 headers, POST with 187-byte body). Single core, AMD EPYC 7763, compiled with -O ReleaseFast. Throughput measured as bytes of HTTP input parsed per second.

| Parser | Throughput | Allocations/request | |--------|-----------|-------------------| | picohttpparser | 1.7 GB/s | 0 | | llhttp | 1.3 GB/s | 0 (callback-based) | | Zig std.http | 0.4 GB/s | 3-7 | | Ours | 2.1 GB/s | 0 |

We beat picohttpparser by 24%. The reason is not algorithmic superiority -- picohttpparser's SIMD scanning is clever and we borrowed the idea. The difference is that our parser is specialized for our request shape. We know the method is POST (for bid requests) or GET (for health checks). We know the headers we care about (Content-Length, Content-Type, X-Request-ID, Authorization). We have a fast-path that checks for these specific headers using the first 4 bytes as a u32 comparison:

const first4 = std.mem.readInt(u32, header_start[0..4], .little);
switch (first4) {
    std.mem.readInt(u32, "Cont", .little) => {
        // Content-Length or Content-Type -- check next bytes
    },
    std.mem.readInt(u32, "Auth", .little) => {
        // Authorization
    },
    std.mem.readInt(u32, "X-Re", .little) => {
        // X-Request-ID
    },
    else => {
        // Unknown header -- still parse, but skip fast-path
    },
}

This is a branch table that the compiler turns into a jump table. Four byte comparison, one branch, done. Generic parsers cannot do this because they do not know which headers matter.

io_uring Integration: The Full Zero-Copy Pipeline

The parser is one piece. The full pipeline -- from NIC to parsed request -- is designed to eliminate every copy.

On Linux, we use io_uring with fixed buffers. At startup, we register a ring of 4KB buffers with the kernel via io_uring_register_buffers. When a read completes, the data is already in user-space memory. There is no read() system call copying from kernel to user buffers -- the kernel DMA'd directly into our pre-registered buffers.

The parser then operates on these buffers in-place. The ParseResult points into the io_uring buffer. The JSON parser reads from the same buffer through the ChunkIterator. The bid processing logic reads field values from the same buffer. From NIC to business logic, the data has been copied exactly once: from the network card to RAM via DMA.

NIC -> DMA -> io_uring buffer -> parse (zero-copy) -> JSON parse (zero-copy) -> bid logic
                |                                                                    |
                +---- single buffer, never copied, never reallocated ----------------+

When bid processing completes, we return the io_uring buffer to the kernel for reuse. The entire request lifecycle touches one buffer.

Why This Matters at Scale

At 200,000 requests per second, eliminating allocations removes approximately 600,000 malloc/free calls per second (the typical parser does 3+ allocations per request). Each malloc call takes ~50 nanoseconds uncontended and up to 2 microseconds under contention. At scale, that is the difference between a p99 latency of 4 milliseconds and a p99 of 1.8 milliseconds.

We measured this directly. Before the zero-allocation parser (using Zig's std.http), our ad serving p99 was 4.1ms. After: 1.8ms. The parser itself accounts for about 40% of that improvement. The other 60% comes from the io_uring zero-copy pipeline eliminating read() syscalls.

Our allocation rate in the entire HTTP request path -- from TCP accept to response write -- is zero. The only allocations in the bid processing path are in the bid logic itself (constructing the response), and those come from a per-request arena that resets without freeing individual objects.

Zero allocation is not a parlor trick. It is a systems design principle. If your hot path allocates, your hot path has a latency tax that scales with concurrency. We removed the tax.