The Unreasonable Effectiveness of io_uring
io_uring changed what is possible in userspace I/O. Our database would not exist without it.
We are building a database. Not because we wanted to (nobody should want to build a database), but because the performance characteristics we need for real-time ad serving do not exist in any off-the-shelf solution. Sub-millisecond range scans across time-series data with concurrent high-throughput writes and strict durability guarantees. We evaluated everything. PostgreSQL, ScyllaDB, ClickHouse, TiKV, FoundationDB. All excellent databases. None fast enough for our specific access patterns at our specific scale.
So we wrote our own, in Zig, and the single technology that makes it viable is io_uring.
What io_uring Is
io_uring is a Linux kernel interface for asynchronous I/O, introduced in kernel 5.1 (2019). The core insight is elegant: instead of making syscalls for every I/O operation, the kernel and userspace share two ring buffers in memory. The application writes I/O requests into the submission queue (SQ). The kernel writes completions into the completion queue (CQ). Both sides can process their respective queues without any syscall overhead for individual operations.
Traditional I/O in Linux works like this:
- Application calls
read()(syscall, context switch to kernel) - Kernel performs the read
- Kernel returns to userspace (context switch back)
- Repeat for every single I/O operation
With io_uring:
- Application writes 50 read operations into the submission queue (no syscall)
- Application calls
io_uring_enter()once (single syscall for all 50 operations) - Kernel processes all 50 reads
- Application reads 50 completions from the completion queue (no syscall)
One syscall instead of 100. But the real gain is deeper than syscall reduction. The submission and completion queues are memory-mapped, so the kernel and userspace share them with zero copy. With kernel-side polling enabled (IORING_SETUP_SQPOLL), the kernel polls the submission queue continuously, eliminating even the single io_uring_enter() call. True zero-syscall I/O.
How Our Database Uses io_uring
Our database is a log-structured merge tree (LSM-tree) optimized for time-series advertising data: impressions, clicks, bids, and budget state. The workload is write-heavy (87% writes) with range scans that must complete in under 1 millisecond for real-time bid decisioning.
We use io_uring for three categories of operations:
Batched Disk Reads for Range Scans
A range scan across an LSM-tree may touch dozens of sorted string table (SST) files across multiple levels. Traditional approaches issue individual pread() calls for each block, serializing disk access. With io_uring, we submit all block reads simultaneously:
fn submitRangeScanReads(ring: *IoUring, blocks: []const BlockRef) !void {
for (blocks) |block| {
const sqe = try ring.getSubmissionEntry();
sqe.prepRead(
block.fd,
block.buffer,
block.size,
block.offset,
);
sqe.user_data = @intFromPtr(block);
}
_ = try ring.submit();
}
For a typical range scan touching 34 blocks across 8 SST files, this reduces the operation from 34 syscalls to 1. On NVMe SSDs with high queue depth capability, the disk controller processes these reads in parallel. Our benchmarks show a range scan over 34 blocks completing in 0.31ms with io_uring versus 1.12ms with sequential pread() calls. The speedup is not 34x because NVMe drives have internal parallelism that sequential reads can partially exploit, but the 3.6x improvement from eliminating syscall overhead and maximizing queue depth is consistent and significant.
Linked Operations for Atomic Writes
LSM-tree writes involve appending to a write-ahead log (WAL) and then acknowledging the write. These two operations must be atomic: if the WAL append succeeds, the acknowledgment must be sent. If the WAL append fails, the acknowledgment must not be sent.
io_uring supports linked operations via the IOSQE_IO_LINK flag. When an SQE is linked to the next SQE, the second operation only executes if the first succeeds. We use this to link WAL writes with completion notifications:
fn submitLinkedWrite(ring: *IoUring, wal_fd: fd_t, data: []const u8, offset: u64, completion_eventfd: fd_t) !void {
// First: write to WAL
const sqe_write = try ring.getSubmissionEntry();
sqe_write.prepWrite(wal_fd, data, offset);
sqe_write.flags |= IOSQE_IO_LINK; // Link to next operation
// Second: signal completion (only executes if write succeeds)
const sqe_signal = try ring.getSubmissionEntry();
const val: u64 = 1;
sqe_signal.prepWrite(completion_eventfd, std.mem.asBytes(&val), 0);
_ = try ring.submit();
}
This gives us atomic write-then-notify semantics without any application-level error handling between the two operations. The kernel guarantees the ordering. We measured write acknowledgment latency at 0.089ms with linked operations versus 0.14ms with separate syscalls and application-level sequencing. The 37% improvement comes from eliminating the userspace round-trip between the write and the notification.
Fixed-Buffer Registration
io_uring allows pre-registering buffers with the kernel via IORING_REGISTER_BUFFERS. Registered buffers are pinned in physical memory, eliminating the kernel's need to pin/unpin pages for every I/O operation. For a database that performs millions of I/O operations per second, the page pinning overhead of unregistered buffers is substantial.
We maintain a pool of 4,096 registered buffers, each 64KB (matching our SST block size). When issuing reads, we use IORING_OP_READ_FIXED instead of IORING_OP_READ, referencing a buffer by its registered index rather than by pointer. This avoids the get_user_pages_fast() kernel path for every read, saving approximately 0.8 microseconds per operation. At 1.2 million read operations per second, that is 960 milliseconds of CPU time saved per second. Nearly an entire core.
Benchmarks
We benchmarked our database's I/O layer in three configurations on the same hardware (AMD EPYC 7763, Samsung PM9A3 NVMe, Linux 6.8):
| Configuration | Throughput (ops/s) | P50 Latency | P99 Latency |
|--------------|-------------------|-------------|-------------|
| read()/write() synchronous | 189,000 | 4.2us | 47us |
| epoll + non-blocking I/O | 412,000 | 2.1us | 23us |
| io_uring with registered buffers | 1,147,000 | 0.8us | 8.3us |
io_uring delivers 2.8x the throughput of epoll and 6.1x the throughput of synchronous I/O under high concurrency (256 concurrent operations). The latency improvements are equally dramatic: P99 latency drops from 47 microseconds to 8.3 microseconds.
The throughput advantage widens with concurrency. At 16 concurrent operations, io_uring is 1.4x faster than epoll. At 256 concurrent operations, it is 2.8x. At 1,024 concurrent operations, it is 3.9x. io_uring's submission queue batching amortizes the per-operation overhead more effectively as concurrency increases.
Advanced Patterns
Beyond the basics, we use several advanced io_uring features:
Multishot receive (IORING_OP_RECV_MULTISHOT): For our database's network layer, multishot receive submits a single SQE that generates a CQE for every incoming packet. Traditional approaches require re-submitting a receive operation after every packet. Multishot reduces submission queue pressure by 95% for our replication protocol, which processes 340,000 messages per second per connection.
Registered file descriptors (IORING_REGISTER_FILES): Similar to registered buffers, pre-registering file descriptors eliminates fget()/fput() overhead in the kernel's file descriptor table lookup. For our database with 2,000+ open SST files, this removes measurable contention on the file descriptor table's read-copy-update (RCU) mechanism under concurrent access.
Kernel-side polling (IORING_SETUP_SQPOLL): A dedicated kernel thread polls the submission queue, eliminating the need for io_uring_enter() syscalls entirely. Our write-ahead log uses this because WAL appends are latency-critical and continuous. The trade-off is a dedicated CPU core for the kernel polling thread, which is worthwhile for our write-heavy workload.
Buffer group selection (IORING_OP_PROVIDE_BUFFERS): Instead of pre-assigning a buffer to each read operation, we provide a group of buffers and let the kernel choose one at completion time. This eliminates the need to predict which reads will complete first, reducing buffer management complexity in our read path.
The Trade-Off: Linux Only
io_uring is Linux-only. There is no equivalent on macOS, FreeBSD, or Windows. This is the single largest drawback.
For production, this is not a problem. Our servers run Linux. Every cloud provider runs Linux. If you are deploying a database, it runs on Linux.
For development, it is a problem. Half our engineering team uses macOS laptops. We maintain a kqueue fallback for macOS that provides the same API surface but uses kqueue for event notification and falls back to synchronous I/O for operations that io_uring would batch. The fallback is approximately 2x slower than the io_uring path, which is fine for development but means performance testing must happen on Linux.
The abstraction layer is thin: 600 lines of Zig that present a unified AsyncIO interface. The io_uring implementation is 400 lines. The kqueue fallback is 200 lines (simpler because it supports fewer advanced features). Comptime dispatch in Zig means zero runtime overhead for the abstraction:
const AsyncIO = if (builtin.os.tag == .linux)
@import("io/uring.zig").IoUring
else if (builtin.os.tag == .macos)
@import("io/kqueue.zig").KqueueIO
else
@compileError("unsupported platform");
No vtable. No function pointers. The compiler generates specialized code for each platform. On Linux, you get raw io_uring performance with zero abstraction cost.
Why io_uring Changes the Equation
Before io_uring, building a high-performance storage engine in userspace required one of two approaches: use O_DIRECT with thread pools to simulate async I/O (what RocksDB does), or use libaio with its restrictions and limitations (no buffered I/O, no network I/O, fixed-size submission). Both approaches are cumbersome and leave performance on the table.
io_uring provides a single interface for disk I/O, network I/O, timers, file operations, and inter-process communication. It supports both buffered and direct I/O. It batches submissions. It supports operation linking. It provides zero-copy with registered buffers. It eliminates syscalls with kernel-side polling.
Our database would not exist without it. Not because we could not have built something, but because the performance targets we need, sub-millisecond range scans at 1 million operations per second with strict durability, are not achievable with traditional Linux I/O interfaces on commodity hardware. io_uring closes the gap between what the hardware can do and what the operating system lets us do.
That gap used to be enormous. With io_uring, it is nearly zero. And that changes what is possible for small teams building storage engines. You no longer need kernel bypass (DPDK/SPDK) or custom kernel modules to saturate modern NVMe hardware. You just need io_uring and the willingness to learn a new programming model.
The unreasonable effectiveness of io_uring is not that it makes I/O faster. It is that it makes fast I/O accessible.