Why We Are Building Our Own Database
Every database we tested added milliseconds we refused to accept. So we started writing our own in Zig.
There is a number that governs everything we build at Anokuro: 10 milliseconds. That is the total latency budget for an ad-serving decision. Not 10ms for the database. 10ms for everything: receive the bid request over the wire, deserialize it, look up user segments, run the auction logic, select a creative, serialize the response, and send it back. The database gets maybe 2ms of that budget. Probably less.
We benchmarked 14 databases under realistic ad-auction workloads. Every single one failed.
The Benchmark
Our test workload models real traffic from the Southeast Asian programmatic advertising market. The characteristics that matter:
- 200,000 bid requests per second sustained, with spikes to 380k during prime-time inventory surges
- Read-heavy with hot keys: the top 1% of user segments account for 40% of lookups
- Time-series writes: every bid, impression, and click is an append-only event with a timestamp key
- Mixed point and range queries: point lookups for user profiles, range scans for frequency capping (show me all impressions for user X in the last 3600 seconds)
- P99 read latency must be under 2ms. Not P50. P99.
We ran every database on identical hardware: bare-metal servers with 128GB RAM, NVMe drives rated at 800k random read IOPS, and 25GbE networking. No containers. No orchestration layers. Just the database and our load generator.
What We Found
PostgreSQL with connection pooling via PgBouncer hit a wall at 80k reads/second. P99 climbed to 8ms. Index scans on our time-series tables were the bottleneck. We tried partitioning by hour. It helped. Not enough.
Redis gave us sub-millisecond reads for the hot path, but our dataset is 340GB. We are not paying for 340GB of RAM across a cluster, and disk-backed Redis (via Redis on Flash) introduced latency jitter that blew our P99 budget on the range scan path.
ScyllaDB was the closest. We got 180k reads/second with a P99 of 2.4ms. Close. But the range scan performance for frequency capping was inconsistent. The LSM compaction strategy that Scylla inherits from Cassandra is optimized for write throughput, not for bounded-latency range reads over a sliding time window.
ClickHouse crushed the analytics queries but is not designed for point lookups with sub-millisecond latency. It is a column store. We need a row store for the hot path and column-oriented access for offline analytics. Two databases means two consistency models, two failure domains, and two sets of operational overhead we do not want.
DuckDB, TiKV, FoundationDB, CockroachDB, Aerospike, Dragonfly, KeyDB, RocksDB (raw), LMDB, and Memgraph all failed on at least one of our requirements. The details fill a 40-page internal document. The short version: general-purpose databases are general-purpose. Our workload is not.
The Decision
We started writing AnokuroDB in August 2025. It is written in Zig.
The choice of Zig was not about hype. We evaluated Zig, Rust, C, and C++ for this project. Rust's borrow checker is excellent for application code, but we found it actively hostile when writing a storage engine. The lifetime annotations for a lock-free skip list with epoch-based reclamation turned into a type-theory research project. We need to ship a database, not prove theorems.
C is too error-prone at our team's velocity. We write a lot of code fast, and C's lack of bounds checking, defer semantics, and any form of compile-time computation made it a non-starter.
Zig gives us what we actually need:
- Manual memory control without a garbage collector. We use arena allocators for request-scoped memory and pool allocators for long-lived structures. Every allocation has an explicit lifetime.
- Comptime for zero-cost generics. Our B-tree implementation is parameterized over key type, value type, page size, and comparison function, and the entire specialization happens at compile time. The generated code is identical to hand-written C for each specific instantiation.
- No hidden control flow. No operator overloloading, no hidden allocations, no exceptions. When we read Zig code, we know exactly what the machine will do.
- C ABI compatibility without FFI wrappers. We link directly against io_uring for async I/O and use POSIX APIs without any marshalling overhead.
The Architecture
AnokuroDB uses a modified LSM-tree optimized for our specific access pattern:
The memtable is a lock-free skip list sorted by a composite key: (segment_id, timestamp). This key structure means that range scans for frequency capping (all events for segment X in the last N seconds) are sequential reads through the skip list, not random access.
Compaction uses a time-window strategy instead of size-tiered or leveled compaction. SSTables are organized into time buckets (1-minute, 1-hour, 1-day). Recent data stays in small, frequently-read tables. Old data gets compacted into larger, rarely-read tables. This is the key insight that ScyllaDB's general-purpose compaction missed for our workload: ad events have a natural time decay. A frequency cap cares about the last hour. Yesterday's impressions are analytics data, not serving data.
The block cache uses a clock-pro eviction policy instead of LRU. Under our read distribution (heavy hot keys), clock-pro gives us a 12% higher hit rate than LRU with the same memory budget. We measured this.
The WAL writes to NVMe using io_uring with O_DIRECT, bypassing the kernel page cache entirely. We do our own buffering in userspace because we know our write pattern better than the kernel does. Each WAL entry is checksummed with xxHash (0.3ns per byte, versus 4.2ns for CRC32C on our hardware).
The query interface is not SQL. It is a purpose-built binary protocol with exactly 6 operations: point-get, multi-get, range-scan, put, delete, and batch-write. There is no query parser, no query planner, no query optimizer. The client library constructs binary messages that map directly to storage engine operations.
Current Numbers
As of this writing, AnokuroDB on our benchmark hardware delivers:
- Point reads: P50 0.18ms, P99 0.7ms
- Range scans (1-hour window): P50 0.4ms, P99 1.4ms
- Sustained write throughput: 420,000 events/second
- Read throughput: 310,000 point reads/second while sustaining 200k writes/second
Our P99 read latency is 3.4x lower than the best general-purpose database we tested.
What This Costs
Building a database is expensive. We have three engineers working on AnokuroDB full-time. The storage engine, compaction, replication, and operational tooling will take another 6 months before we call it production-ready.
But the math works. Our ad-serving latency directly determines auction win rate. Every millisecond of latency improvement translates to a measurable increase in bid competitiveness. We ran the analysis: the revenue impact of the latency improvement pays for the engineering investment in 7 months.
We do not build infrastructure for the joy of building infrastructure. We build it because the alternatives are too slow, and too slow costs money.
The code is not open source yet. It might be eventually. Right now we are focused on one thing: making it fast enough for production. Not fast enough for benchmarks. Fast enough for 200,000 bid requests per second with money on every single one.