I wanted to clarify to myself how we achieve Atomicity in Delta Lake, and itโs easier for me to compare it to traditional databases.
Atomicity means that when we commit a transaction, itโs an all-or-nothing operation. Either the transaction succeeds or it fails, nothing in between. Consistency means that despite our transaction being atomic, if some database constraint (e.g., a foreign key) is violated, then the operation is rejected and rolled back. Atomicity pertains to the mechanism, whereas consistency checks if the transactions make logical sense given the constraints of the data model.
Letโs first cover how โtraditionalโ databases like MySQL and Postgres achieve atomicity with a Write-Ahead log (WAL). Itโs best understood from the perspective ofย recovery:
We make changes to pages in memory – they can be flushed to disk at unpredictable times. Imagine: we modify a page, the buffer pool keeps it in memory (maybe itโs pinned by another thread), and the DB crashes before itโs written to disk. That means our commit was not durable – the user received a success message, but the page was still in RAM.
[Traditional DB - Without WAL]
Transaction modifies Page X in buffer pool
โ
User gets "COMMIT SUCCESS"
โ
๐ฅ CRASH (before page flushes)
โ
Page X changes lost forever
By writing the WAL (withย fsync) to disk first before acknowledging the COMMIT, the buffer pool can flush pages whenever it wants (following something like LRU-K). If the DB crashes, we simply replay the log, ensuring no partial writes.
[Traditional DB - With WAL]
1. Modify Page X in memory
2. Write changes to WAL + fsync โ
3. Return "COMMIT SUCCESS" to user
4. (Later) Flush Page X to disk
If crash happens at step 4:
โ Replay WAL on restart
โ Page X changes restored
So how does theย transaction logย in the Delta format give us atomicity? After all, itโs not a WAL.
[Delta Lake Architecture]
Compute Engine (Spark/DuckDB/Trino)
โ
Delta Libraries (embedded)
โ
Object Storage (S3/GCS/ADLS)
In the Delta format, we have a compute engine (e.g., Spark, Duck DB, Trino). The engine manages data in memory, reading and writing Parquet files. The Delta libraries are embedded in the engine to intercept and interact with the storage layer (our proverbial disk) – in this case, distributed object storage (S3, GCS, ADLS, etc.).
Hereโs the key difference: There is no recoveryโฆ what?
The compute engine does not update pages in place, so thereโs no need to worry about unpredictable page flushing to disk. The Parquet files are rewritten to storage (added not overwriting the existing ones), and a JSON log entry is added describing the write operation (copy-on-write). The JSON log entry says:
Remove: file_A.parquetย โ The one we modifiedAdd: file_B.parquetย โ The new one with modified data
Therefore, if thereโs a crash (typically some network error), either the JSON entry exists or it doesnโt. Even if the new Parquet files exist, they are logically invisible (orphaned) to the compute engine. Thus, the sequence is opposite to traditional databases: we write the Parquet files first, and only then write our JSON log.
When we write to a single file in such storage systems, the write is guaranteed to be atomic. This is crucial, as we will see later when we discussย isolationย in delta lakes.
[Delta Lake Commit Sequence]
1. Write file_B.parquet (new data) โ
2. Write JSON log entry atomically โ
{
"remove": {"path": "file_A.parquet"},
"add": {"path": "file_B.parquet"}
}
3. Return "COMMIT SUCCESS"
If crash at step 1:
โ file_B.parquet orphaned
โ No log entry โ logically invisible
โ No recovery needed
Object storage guarantees:
Single file write is atomic โ
This opens a new question! How do we handle this seemingly explosive storage growth?
Leave a Reply