Inside Our Storage Engine: A Zig Love Letter
Page-aligned I/O, zero-copy reads, and an allocator strategy that would make your ORM cry.
We are building a database. Not because we wanted to -- we wanted to serve ads faster than anyone in Southeast Asia. But after watching PostgreSQL burn 40% of our CPU on query planning for workloads that never change shape, we decided the industry's general-purpose tools were holding us back. So we wrote a storage engine in Zig. This is how it works.
Page-Aligned I/O: Respecting the Hardware
Every read and every write in our storage engine operates on 4KB-aligned pages. Not because the textbooks say so -- because our NVMe drives say so. Modern SSDs have an internal page size of 4KB. If you write 100 bytes, the drive reads a 4KB page internally, modifies your 100 bytes, and writes the whole page back. This is write amplification, and it is the silent killer of storage throughput.
We align everything to 4KB boundaries at the application level. Page headers are exactly 64 bytes. Row data is packed into the remaining 4,032 bytes with no padding between fixed-width columns. Variable-length data gets overflow pages. None of this is novel -- it is what every serious database does. The difference is that in Zig, we express this as comptime-known layouts:
const PageHeader = extern struct {
page_id: u64,
lsn: u64,
checksum: u32,
slot_count: u16,
free_space_offset: u16,
page_type: PageType,
_reserved: [29]u8,
};
comptime {
std.debug.assert(@sizeOf(PageHeader) == 64);
std.debug.assert(@alignOf(PageHeader) == 8);
}
That comptime block is not a test. It is a compile-time guarantee. If someone adds a field and breaks the layout, the project does not compile. In C, you would discover this with a segfault in production. In Rust, you would write a #[test] that runs in CI. In Zig, the compiler refuses to emit a binary. This distinction matters when a misaligned page read corrupts customer data.
On Linux, we submit I/O through io_uring with O_DIRECT and page-aligned buffers. On macOS, we use kqueue with F_NOCACHE. Both paths bypass the kernel page cache entirely. We manage our own buffer pool because we know our access patterns better than the kernel ever will -- sequential scans for time-range queries, point lookups for real-time bid enrichment, and absolutely nothing resembling the random access patterns the kernel assumes.
Zero-Copy Reads
When an ad auction query asks for the last 30 seconds of bid data, here is what does not happen: we do not copy data from kernel buffers into user-space buffers, then into deserialization buffers, then into application objects. That four-copy path is what most ORMs give you. It is also why most ORMs cannot sustain 200,000 queries per second.
Our read path is zero-copy. Pages are memory-mapped into the process address space. A "read" is a pointer dereference. The PageHeader struct above is not deserialized from bytes -- it is the bytes on disk, reinterpreted in place:
fn readPage(self: *BufferPool, page_id: u64) *const PageHeader {
const offset = page_id * PAGE_SIZE;
const base: [*]const u8 = @ptrCast(self.mmap_base + offset);
return @ptrCast(@alignCast(base));
}
No allocation. No copy. No serialization overhead. The page on disk is the page in memory. We validate the CRC32 checksum on first access and cache the validation result in a bitset. Subsequent reads skip even the checksum.
Our custom page eviction policy is a clock-sweep variant tuned for time-series access patterns. Recent pages (last 60 seconds of ad data) get pinned. Historical pages use a two-handed clock where the sweep frequency adapts to memory pressure. Under our production workload -- 87% of queries hitting the last 5 minutes of data -- the buffer pool hit rate sits at 99.2%.
Allocator Strategy: Three Allocators, Zero General-Purpose
Zig does not have a default allocator. This is the most underrated feature in the language. Every function that allocates takes an Allocator parameter, which means we can enforce allocation discipline at the type level.
We use exactly three allocators in hot paths:
Arena allocator for queries. Each query gets a thread-local arena. All intermediate results, sort buffers, and hash tables allocate from this arena. When the query completes, we reset the arena pointer to the beginning. Zero individual frees. Zero fragmentation. The arena is pre-allocated at 2MB per thread and never grows during normal operation.
Pool allocator for connections. Each client connection gets a fixed-size block from a pool. Connection state, authentication tokens, and prepared statement caches live here. When a connection disconnects, the block returns to the pool. No system calls for malloc or free in the connection lifecycle.
Page allocator for buffer pool pages. Large allocations for the buffer pool go through mmap directly. No heap fragmentation from 4KB page allocations because they never touch the heap.
The general-purpose allocator (std.heap.page_allocator or c_allocator) is banned in hot paths. We enforce this with a wrapper allocator in debug builds that panics if called from any function in the query execution or connection handling call graph. Three engineers have been saved from accidental allocations in code review by this trap.
WAL: Batched Group Commits
Our write-ahead log does not fsync on every commit. It batches.
When a transaction commits, it appends its log records to an in-memory WAL buffer and enters a wait queue. A dedicated WAL writer thread wakes up every 500 microseconds (or when the buffer hits 64KB, whichever comes first), writes the entire batch to the WAL file with a single pwritev call, calls fdatasync, and wakes all waiting transactions simultaneously.
At our write throughput of 45,000 transactions per second, this means each fdatasync covers roughly 22 transactions. The amortized cost of durability drops from ~200 microseconds per transaction to ~9 microseconds.
Each WAL record includes a CRC32 checksum computed with Zig's comptime capabilities:
fn serializeRecord(comptime T: type, record: T) WalEntry {
const bytes = std.mem.asBytes(&record);
return .{
.checksum = std.hash.Crc32.hash(bytes),
.len = bytes.len,
.data = bytes,
.type_id = comptime typeId(T),
};
}
The comptime typeId function generates a unique identifier for each record type at compile time. During recovery, we use this to dispatch to the correct deserialization path without any runtime type registry. The WAL replay code is a switch on comptime-known type IDs. The compiler can verify exhaustiveness -- if we add a new record type and forget to handle it in recovery, the code does not compile.
Benchmarks: The Numbers
We benchmark against RocksDB and LevelDB because they are the storage engines people actually embed. We do not benchmark against SQLite because our workload is nothing like SQLite's target use case.
Test machine: AMD EPYC 7763 (64 cores), 256GB DDR4-3200, Samsung PM9A3 NVMe. Workload: 80% point reads, 15% range scans (last 30 seconds), 5% writes. Key size: 16 bytes. Value size: 256 bytes. 100 million pre-loaded rows.
| Engine | Throughput (ops/sec) | p50 latency | p99 latency | |--------|---------------------|-------------|-------------| | RocksDB (tuned) | 312,000 | 42 us | 380 us | | LevelDB | 187,000 | 71 us | 920 us | | Ours | 1,467,000 | 8 us | 47 us |
That is a 4.7x throughput improvement over tuned RocksDB and a 8.1x improvement on p99 latency. The difference comes from three places: no compaction stalls (we use a different approach for time-series data that we will cover in a future post), zero-copy reads eliminating memcpy overhead, and our custom allocator strategy avoiding lock contention on the system allocator.
We are not claiming our engine is universally better than RocksDB. RocksDB handles a staggering variety of workloads. We handle one workload -- time-series ad data with known access patterns -- and we handle it extremely well.
The Honest Part
Building a storage engine is the worst kind of engineering project: it takes 10x longer than you expect, the bugs are 10x harder to find, and the first version is always wrong. We have rewritten the buffer pool three times. The WAL format is on its fourth iteration. We spent six weeks on a lock-free page latch that turned out to be slower than a simple mutex.
But when we watch a bid auction resolve in 1.2 milliseconds end-to-end, with the storage layer contributing 47 microseconds of that, we know this is the right call. General-purpose databases are a compromise. We decided to stop compromising.