Why mmap() tanks databases

I/O Stalls

Among the reasons we want to avoid using mmap() and manage memory ourselves is controlling when I/O stalls happen. With a Buffer Pool, we can overlap computation with other threads & explicitly decide when to issue disk I/O. There’s also an opportunity to perform asynchronous I/O, but that is less common.

Page faults

If our database thread tries to fetch data from RAM that hasn’t been loaded from disk, a page fault is triggered (not a hardware interrupt). The OS immediately launches a disk I/O request read() & suspends the thread. When the disk completes the transfer, the OS is notified with a disk interrupt. The thread is placed in a ready state and at some undetermined time the OS Scheduler reschedules it on the CPU.

If we implement our own Buffer Pool, the thread will also be suspended and also waits for the transfer from disk. So why not stick to mmap()?

Page faults aren’t hardware interrupts in the traditional sense – they’re CPU exceptions (trap/fault). The MMU detects an invalid virtual-to-physical mapping and synchronously traps into the kernel’s page fault handler. The handler then issues the I/O. This is different from device interrupts, which are asynchronous signals from hardware.

I/O timing

In all cases, the thread will be rescheduled by the OS when the disk completes transfer to RAM and issues an interrupt. The key difference is when disk I/O is started (when read() is called).

With a Buffer Pool, rather than immediately launch a disk I/O we can make more intelligent decisions that the OS cannot. The primary benefit is batching the I/O. Imagine we have multiple transactions:

Transaction 1: needs page A
Transaction 2: needs page B
Transaction 3: needs page A

In this scenario, the buffer pool sees this access pattern but the OS cannot. The BF can decide to prioritise read(page A) performing a single Disk I/O for multiple threads.

TLB shoot-downs

Another risk is severe performance hits when translation look-aside buffers must be invalidated.

With a buffer pool, a static region of memory is assigned to the process by the OS. It maps to the same physical memory region. After first access, the TLB in each CPU core can hold the same translation until the DB process exits. A buffer pool can read and evict pages to this static physical memory location. Under memory pressure, assuming mlock() is used, the virtual addresses will always map to the same physical locations.

With mmap pages are loaded into physical regions of memory for each page. On page reads and evictions, the regions change: After an eviction, it’s unlikely the page will be read back into the same physical region. The changing of physical regions means the TLB must be invalidated in the reading CPU core, if other cores hold stale addresses they must also be flushed. An inter-processor interrupt is sent to all other cores, which is very expensive.

With mmap virtual addresses stay constant (each file offset maps to the same VA), but the physical frames backing those VAs change on eviction/reload

Why mmap() tanks databases

I/O Stalls

Page faults

I/O timing

TLB shoot-downs

Comments

Leave a Reply Cancel reply