Memory-Mapped I/O for Time-Series Ad Data

mmap is not a database strategy. Unless you control the entire stack and know exactly what you are doing.

By Anokuro Engineering··Infrastructure

Andy Pavlo's 2022 paper "Are You Sure You Want to Use MMAP in Your Database Management System?" made a compelling case against mmap for general-purpose databases. The TL;DR: the OS page cache makes eviction decisions that are wrong for database workloads, TLB shootdowns kill multi-core performance, and you cannot control I/O scheduling. He is right. For general-purpose databases, mmap is a trap.

We use mmap anyway. Here is why we are not wrong.

Why the Conventional Wisdom Does Not Apply

Pavlo's paper identifies four problems with mmap in databases:

  1. Transactional safety: mmap pages can be written back at any time, breaking WAL ordering
  2. I/O stalls: page faults are synchronous and unpredictable
  3. Error handling: I/O errors on mmap'd pages become SIGBUS signals, not return codes
  4. Performance: the OS knows nothing about your access patterns

Every one of these is a real problem. And every one of them is solvable when your workload has specific, known characteristics. Ours does.

Our storage engine handles time-series ad data with these properties:

  • Append-heavy: 95% of writes are appends to the most recent time partition. We never update historical data.
  • Time-ordered: data is physically ordered by timestamp. Queries are time-range scans.
  • Rarely updated: once written, ad impression and bid records are immutable. The only mutations are late-arriving attribution events, which account for less than 0.3% of writes.
  • Scan-dominated: the primary query pattern is "give me all bid records for campaign X in the last N minutes." This is a sequential scan over a contiguous region.

This workload is the best case for mmap. Sequential scans over append-only, time-ordered data is exactly what the OS page cache was designed for. The page cache's LRU eviction policy is correct here because old data is cold data -- queries almost never touch records older than 24 hours.

Custom Page Fault Handling with madvise

The first problem with naive mmap is unpredictable page faults. When you access a page that is not resident, the kernel stalls your thread, issues a synchronous read, and blocks until the I/O completes. For a database serving latency-sensitive ad auctions, a random 200-microsecond stall is unacceptable.

We solve this with proactive faulting and madvise hints.

Pre-faulting hot regions. When a time partition is opened, we pre-fault the pages covering the most recent 5 minutes of data using madvise(MADV_WILLNEED):

fn prefaultPartition(self: *Partition, from_ts: i64) void {
    const start_offset = self.timestampToOffset(from_ts);
    const end_offset = self.currentWriteOffset();
    const region = self.mmap_base[start_offset..end_offset];

    // Advise kernel to read ahead
    std.posix.madvise(
        @ptrCast(region.ptr),
        region.len,
        .WILLNEED,
    ) catch |err| {
        log.warn("madvise WILLNEED failed: {}", .{err});
    };

    // Touch each page to force population
    var offset: usize = 0;
    while (offset < region.len) : (offset += PAGE_SIZE) {
        _ = @as(*volatile u8, @ptrCast(&region[offset])).*;
    }
}

The MADV_WILLNEED hint tells the kernel to start reading pages asynchronously. The subsequent volatile reads force-fault each page. After this function completes, the hot region is fully resident. Queries that hit this region will never stall on a page fault.

Releasing cold regions. When a partition ages beyond our 24-hour retention window, we call madvise(MADV_DONTNEED) to release the physical pages back to the OS without unmapping the virtual address range:

fn releaseColdPages(self: *Partition, before_ts: i64) void {
    const end_offset = self.timestampToOffset(before_ts);
    std.posix.madvise(
        @ptrCast(self.mmap_base),
        end_offset,
        .DONTNEED,
    ) catch {};
}

This is manual eviction. We decide what to evict based on our knowledge of query patterns, not the kernel's LRU. Hot data stays resident. Cold data is released. The transition is explicit, not probabilistic.

Sequential scan hints. When a time-range query starts, we issue madvise(MADV_SEQUENTIAL) on the target region. This tells the kernel to read ahead aggressively and drop pages behind the scan cursor:

fn beginRangeScan(self: *Partition, from: i64, to: i64) ScanIterator {
    const start = self.timestampToOffset(from);
    const end = self.timestampToOffset(to);

    std.posix.madvise(
        @ptrCast(self.mmap_base[start..end].ptr),
        end - start,
        .SEQUENTIAL,
    ) catch {};

    return .{
        .partition = self,
        .current = start,
        .end = end,
    };
}

With sequential hints, the kernel pre-fetches pages 128KB ahead of our read position. A 30-second range scan over 50MB of data completes with zero page faults after the first few pages because the kernel's readahead outruns our scan cursor.

TLB Shootdowns: The Real Enemy

TLB (Translation Lookaside Buffer) shootdowns are the performance problem that kills mmap-based databases at scale. When one core modifies a page table entry -- because a page was evicted, remapped, or permissions changed -- it must invalidate the TLB entry on every other core that might have cached it. This is done via an IPI (inter-processor interrupt), which stalls the target core for 1-5 microseconds.

On a 64-core server handling concurrent queries, TLB shootdowns are devastating. Pavlo's paper measured up to 25% throughput loss from TLB shootdowns under concurrent mmap workloads. This is real and we have measured it ourselves.

Our mitigation: partitioning.

Each time partition (covering one hour of data) is mapped independently into a separate virtual address range. Queries are routed to the specific partitions they need. A query for "last 5 minutes" touches at most 2 partitions (the current hour and possibly the previous one). A query for "last 24 hours" touches at most 25 partitions.

The critical insight: TLB shootdowns only affect cores that have accessed the same virtual address range. If core 0 is scanning partition 14 and core 31 is scanning partition 7, a page eviction in partition 14 does not cause a TLB shootdown on core 31. By partitioning our data into independent mappings, we confine TLB shootdowns to the cores that are actually sharing data.

We measured the impact:

| Configuration | TLB shootdowns/sec | p99 query latency | |--------------|--------------------|--------------------| | Single large mmap | 47,000 | 8.2 ms | | Hourly partitions | 3,200 | 2.6 ms | | With CPU affinity | 1,100 | 1.9 ms |

The addition of CPU affinity -- pinning query threads for a specific time range to the same NUMA node -- reduces cross-core TLB invalidations further. Queries for recent data (the hot partition) run on cores 0-15. Queries for historical data run on cores 16-31. The partitions' page tables are rarely shared across NUMA nodes, minimizing the blast radius of shootdowns.

Transactional Safety: The WAL Boundary

Pavlo's first objection -- that mmap breaks WAL ordering because the OS can flush dirty pages before the WAL is durable -- is completely valid. It is also irrelevant to our architecture.

Our mmap regions are read-only. We never write through mmap. The write path goes through a separate WAL + memtable pipeline:

  1. Write arrives and is appended to the WAL with fdatasync
  2. Write is inserted into an in-memory buffer (the memtable)
  3. When the memtable reaches 16MB, it is flushed to a new immutable file on disk
  4. The new file is mmap'd read-only and added to the partition

Because the mmap regions are PROT_READ only, the OS cannot write them back. There are no dirty pages to flush out of order. The WAL provides durability. The mmap provides fast reads. The two paths never interfere.

Error Handling: SIGBUS

When an mmap'd page cannot be read (disk error, NFS timeout, etc.), the kernel delivers SIGBUS to the process. In a naive implementation, this kills the process. We install a SIGBUS handler that records the faulting address and sets a thread-local error flag:

fn sigbusHandler(sig: c_int, info: *std.posix.siginfo_t, ctx: ?*anyopaque) callconv(.c) void {
    _ = sig;
    _ = ctx;
    const addr = @intFromPtr(info.fields.sigfault.addr);
    tls_fault_addr.* = addr;
    tls_fault_flag.* = true;

    // Advance instruction pointer past the faulting load
    // (architecture-specific)
    advanceIP(info);
}

After every page access in a scan, we check the fault flag. If set, we log the error, skip the corrupted page, and continue the scan. This is ugly. It uses architecture-specific instruction pointer manipulation. But it converts a process-killing signal into a recoverable error, which is the difference between "the dashboard is down" and "the dashboard shows a gap in the data."

The Results: 3.1x Improvement in p99

We benchmarked our mmap-based read path against a read()-based implementation on the same data and hardware.

Test: 1,000 concurrent time-range queries, each scanning 30 seconds of data (~2MB). AMD EPYC 7763, 256GB RAM, Samsung PM9A3 NVMe. Data set: 500GB total, 6GB hot (last hour).

| Approach | p50 latency | p99 latency | Throughput (queries/sec) | |----------|------------|-------------|------------------------| | read() + buffer pool | 1.4 ms | 5.9 ms | 12,400 | | mmap (naive) | 1.1 ms | 11.2 ms | 10,800 | | mmap (our approach) | 0.9 ms | 1.9 ms | 18,700 |

Naive mmap is worse than read() at p99 because of unpredictable page faults and TLB shootdowns. Our approach -- pre-faulting, partition isolation, CPU affinity, sequential hints -- eliminates the tail latency problems while keeping mmap's advantage: zero system call overhead for reads.

The 3.1x improvement in p99 (5.9ms to 1.9ms) comes from three factors:

  1. No syscall overhead. read() requires a system call per page. mmap accesses are pointer dereferences.
  2. No buffer pool management. The read() approach maintains a user-space buffer pool with hash table lookups, reference counting, and eviction. mmap offloads this to the kernel.
  3. Pre-faulting eliminates stalls. Hot data is always resident. Cold data is explicitly released. The unpredictability that kills naive mmap is removed.

The Caveat

This works because we control every layer. We control the data format (fixed-size records, time-ordered). We control the access patterns (sequential scans, append-only writes). We control the hardware (NVMe with predictable latency, enough RAM to keep hot data resident). We control the OS configuration (huge pages enabled, vm.dirty_ratio tuned, NUMA-aware allocation).

If you are building a general-purpose database, do not use mmap. Pavlo is right and you should listen to him.

If you are building a specialized storage engine for a specific workload on hardware you control, mmap is not a strategy -- it is a weapon. You just have to know exactly where to aim it.

Copyright © 2026 Anokuro Pvt. Ltd. Singapore. All rights reserved.