Apurva

@apurvarathi

From WALs to Indexes: The Database Internals Hidden Inside Modern Lakehouses

Submitted Jun 23, 2026

Abstract

Lakehouses built on open table formats have emerged as the de facto architecture for modern analytical data systems, yet few practitioners appreciate how deeply database internals underpin their design. Modern open table formats are often described as metadata layers on top of Parquet files, but beneath the surface they have quietly reinvented many of the core ideas that powered databases for decades. Delta Lake uses transaction logs reminiscent of database write-ahead logs (WALs), Iceberg relies on hierarchical metadata structures that behave like indexes, while Hudi and Paimon draw inspiration from storage-engine concepts such as log-structured storage and compaction.

This talk explores how the move to object storage forced data systems to reimplement transactions, concurrency control, indexing, catalogs, compaction & snapshot isolation and what these design choices reveal about the convergence of databases and lakehouses. Through architectural comparisons of Delta Lake, Apache Iceberg, Apache Hudi, and Apache Paimon, attendees will gain a deeper understanding of the storage and metadata foundations that power modern analytical platforms.

We conclude by exploring a growing industry trend toward tiered architectures that combine specialised storage engines for hot data with open table formats for historical data, raising an intriguing question: are we beginning to come full circle after a decade of rebuilding database primitives on top of object stores?

Note: The talk assumes familiarity with data lakes and SQL analytics but does not require prior knowledge of open table formats.

Key Takeaways

  • Build an intuition for the database internals hidden beneath modern lakehouses, including transaction logs, snapshots, catalogs, metadata indexes, and storage layouts.
  • Leave with a framework for reasoning about the evolution of data architectures—from databases to data lakes to lakehouses—and where the industry may be headed next.

Target Audience

This session is intended for data engineers, data platform engineers, software engineers, and technical architects who want to understand the database and storage-system concepts that underpin modern lakehouse architectures such as Delta Lake, Iceberg, Hudi, and Paimon. Familiarity with data lakes or analytical data platforms is helpful but no prior knowledge of open table formats is required.

Bio

Apurva Rathi is a data platform engineer with over a decade of experience building large-scale data systems at Atlassian, Meta and Mastercard. Her work has spanned everything from data engineering to building self-service, governed data platforms that enable organizations to make data-driven decisions at scale.

Most recently, she led the development of core components of Atlassian’s next-generation data platform, including self-service transformation, ingestion, data sharing, and metrics capabilities. During her sabbatical, she has been exploring the internals of modern data systems, with a particular focus on open table formats, storage and query engines, stream processing, and AI-native data architectures. She has authored a series of technical deep-dives on lakehouse and streamhouse architectures and enjoys connecting contemporary data systems back to the fundamental concepts that inspired them.

Draft Slides

Coming Soon
ETA: 25th June

Elevator Pitch

https://www.loom.com/share/bb430bd50c154fcf90f178e8f336bb61

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures