How Delta Lakes Implement Atomicity

I wanted to clarify to myself how we achieve Atomicity in Delta Lake, and itโ€™s easier for me to compare it to traditional databases.

Atomicity means that when we commit a transaction, itโ€™s an all-or-nothing operation. Either the transaction succeeds or it fails, nothing in between. Consistency means that despite our transaction being atomic, if some database constraint (e.g., a foreign key) is violated, then the operation is rejected and rolled back. Atomicity pertains to the mechanism, whereas consistency checks if the transactions make logical sense given the constraints of the data model.

Letโ€™s first cover how โ€œtraditionalโ€ databases like MySQL and Postgres achieve atomicity with a Write-Ahead log (WAL). Itโ€™s best understood from the perspective ofย recovery:

We make changes to pages in memory – they can be flushed to disk at unpredictable times. Imagine: we modify a page, the buffer pool keeps it in memory (maybe itโ€™s pinned by another thread), and the DB crashes before itโ€™s written to disk. That means our commit was not durable – the user received a success message, but the page was still in RAM.

[Traditional DB - Without WAL]

Transaction modifies Page X in buffer pool
  โ†“
User gets "COMMIT SUCCESS"
  โ†“
๐Ÿ’ฅ CRASH (before page flushes)
  โ†“
Page X changes lost forever

By writing the WAL (withย fsync) to disk first before acknowledging the COMMIT, the buffer pool can flush pages whenever it wants (following something like LRU-K). If the DB crashes, we simply replay the log, ensuring no partial writes.

[Traditional DB - With WAL]

1. Modify Page X in memory
2. Write changes to WAL + fsync  โœ“
3. Return "COMMIT SUCCESS" to user
4. (Later) Flush Page X to disk
   
If crash happens at step 4:
โ†’ Replay WAL on restart
โ†’ Page X changes restored

So how does theย transaction logย in the Delta format give us atomicity? After all, itโ€™s not a WAL.

[Delta Lake Architecture]

Compute Engine (Spark/DuckDB/Trino)
  โ†“
Delta Libraries (embedded)
  โ†“
Object Storage (S3/GCS/ADLS)

In the Delta format, we have a compute engine (e.g., Spark, Duck DB, Trino). The engine manages data in memory, reading and writing Parquet files. The Delta libraries are embedded in the engine to intercept and interact with the storage layer (our proverbial disk) – in this case, distributed object storage (S3, GCS, ADLS, etc.).

Hereโ€™s the key difference: There is no recoveryโ€ฆ what?

The compute engine does not update pages in place, so thereโ€™s no need to worry about unpredictable page flushing to disk. The Parquet files are rewritten to storage (added not overwriting the existing ones), and a JSON log entry is added describing the write operation (copy-on-write). The JSON log entry says:

  • Remove: file_A.parquetย โ† The one we modified
  • Add: file_B.parquetย โ† The new one with modified data

Therefore, if thereโ€™s a crash (typically some network error), either the JSON entry exists or it doesnโ€™t. Even if the new Parquet files exist, they are logically invisible (orphaned) to the compute engine. Thus, the sequence is opposite to traditional databases: we write the Parquet files first, and only then write our JSON log.

When we write to a single file in such storage systems, the write is guaranteed to be atomic. This is crucial, as we will see later when we discussย isolationย in delta lakes.

[Delta Lake Commit Sequence]

1. Write file_B.parquet (new data) โœ“
2. Write JSON log entry atomically  โœ“
   {
     "remove": {"path": "file_A.parquet"},
     "add": {"path": "file_B.parquet"}
   }
3. Return "COMMIT SUCCESS"

If crash at step 1:
โ†’ file_B.parquet orphaned
โ†’ No log entry โ†’ logically invisible
โ†’ No recovery needed

Object storage guarantees:
Single file write is atomic โœ“

This opens a new question! How do we handle this seemingly explosive storage growth?


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *