Aayush Naik

@naikaayush

Delta Lake Write Internals: INSERT, UPDATE, DELETE From the Ground Up

Submitted Jun 25, 2026

Delta Lake makes the table mutable, but the underlying parquet files are physically immutable. In this talk, we will dive into the internals of Insert, Update and Delete operations. We begin with the introduction: Parquet (columnar storage), Delta log (txn record of all add & remove actions), and define that these constitute the Delta Table. We begin with INSERT, which is straightforward, write a new parquet file and append an add action to the log. The interesting one is the UPDATE.

The bulk of the talk will concentrate on UPDATE, we will walk through the idea of copy-on-write semantics. Starting with the naive appraoch (just rewrite every row in the table for a 1 row change), expose why it collapses, then we build the real mechanism. A first scan to find the rows that need to be updated, and retrive the file IDs of the files that contain these rows. A second scan of these file IDs only, and then we modify according to the query, and then we do 3 things.

  1. append an add action
  2. append a remove action of the affected files
  3. write the new parquet.

We then extend the same idea to DELETE, but we introduce 3 types of deletes, full-table, partition, and predicate-based. And we then close with introducing merge-on-read concept with the help of DeletionVectors feature.

Footnotes: data-skipping optimizations such as partition pruning, min/max and column statistics are mentioned but skipped.

Takeaways

  • A systems mental model on how copy-on-write and internals of DML.
  • trade off between copy-on-write and merge-on-read (write amplification extra filtering for the reader)

Audiences
Data engineers working with lakehouse architectures (Delta/Iceberg/Hudi). Platform engineers who own pipeline performance and storage costs, and anyone who has wondered why a small UPDATE rewrites large amounts of data. Useful for those debugging write amplification or evaluating when to enable deletion vectors.

Bio
I work at e6data as part of the query engine team. I am currently working on adding DML support on Deltalake tables to the rust-based query engine.

Presentation Link
https://pitch.com/v/delta-lake-internals-tjm7wv

A link to the workshop I conducted as part of Rootconf
https://hasgeek.com/rootconf/build-a-sql-query-engine-from-scratch-workshop/

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures