Tickets

Loading…

Vishnu Vasanth

A new approach to building high-performance lakehouse compute engines for open table formats like Delta lake, Apache Iceberg, and Apache Hudi

Submitted May 15, 2024

Introduction

Platform engineering and data architecture teams are increasingly adopting object-store backed data lakehouses as their central, unified platform for workloads across Analytics as well as AI.

With the scale of such data lakehouses ranging from the 10s of TBs to the 100s of PBs, distributed compute engines like Spark, Trino / Presto, Flink, etc. are essential for workloads across:

  • Data ingestion
  • Data transformation / pre-processing
  • Data Querying / inference

This talk covers the common challenges data platform teams encouter with popular distributed compute engines at scale.

We then outline our approach to building a new class of hyper-efficient compute engine from scratch. We also outline how this new approach provides substantial advantges in a class of technically challenging workloads that combine one or more of:

  1. High concurrency
  2. Query Complexity
  3. High Data Volumes
  4. Stringent latency requirements

The talk will have a mix of presentation (slides), benchmarking, live demos, and audience Q&A.

Target Audience

Engineers, researchers and data architects with an interest in:

  • Massively parallel distributed compute platforms
  • The internals of existing and emerging compute engines
  • Composable open data platforms with an emphasis on object-store based data lakes and lakehouses with open table formats like Delta lake, Iceberg, and Hudi

Outline

With a query’s lifecycle as the frame of reference, we start with examining the strengths and weaknesses of the present engines.

While most distributed compute engines are available as Open Source (OSS) as well Commercial Open Source Software, all of them share commonalities on the following areas:
A - Monolithic, stateful and “VM-centric” Architectures
B - Centralized and static approach to distributed processing and execution

We then present how a clean-slate approach helped us build a system that overcomes the key limitations through the use of:
A - Disaggregated, stateless, and “kubernetes-native” Architecture
B - Decentralized and dynamic approach to distributed processing and execution

Takeaways, Impact

We will present findings from real-world workloads around how this new approach drives benefits across evaluation criteria that matter to platform engineering teams:

1 - A materially superior Price-Performance curve
2 - Eliminating system-wide Single Points of Failure (SPOF)
3 - Maintaining Deterministic tail latencies (p99) even under heavy loads and massive variability
4 - Efficient cluster utilization even when faced with data skew, variable task completion, etc.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid Access Ticket

Hosted by

All about data science and machine learning

Supported by

Gold Sponsor

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Silver Sponsor