A collection of data engineering notes.

  • Why mmap() tanks databases

    I/O Stalls Among the reasons we want to avoid using mmap() and manage memory ourselves is controlling when I/O stalls happen. With a Buffer Pool, we can overlap computation with other threads & explicitly decide when to issue disk I/O. There’s also an opportunity to perform asynchronous I/O, but that is less common. Page faults…

  • How the Delta Format Achieves Isolation

    There is a very heavy penalty to pay in Delta Lake when transactions conflict. This note explores how the delta format achieves isolation between transactions. In the best case scenario we enjoy lock free concurrency, in the worst case we waste compute and storage. That is the consequence of how concurrency is achieved with the…

  • How Delta Lakes Implement Atomicity

    I wanted to clarify to myself how we achieve Atomicity in Delta Lake, and it’s easier for me to compare it to traditional databases. Atomicity means that when we commit a transaction, it’s an all-or-nothing operation. Either the transaction succeeds or it fails, nothing in between. Consistency means that despite our transaction being atomic, if…