Platform engineering and data architecture teams are increasingly adopting object-store backed data lakehouses as their central, unified platform for workloads across Analytics as well as AI.

With the scale of such data lakehouses ranging from the 10s of TBs to the 100s of PBs, distributed compute engines like Spark, Trino / Presto, Flink, etc. are essential for workloads across:

Data ingestion
Data transformation / pre-processing
Data Querying / inference

This talk covers the common challenges data platform teams encouter with popular distributed compute engines at scale.

We then outline our approach to building a new class of hyper-efficient compute engine from scratch. We also outline how this new approach provides substantial advantges in a class of technically challenging workloads that combine one or more of:

High concurrency
Query Complexity
High Data Volumes
Stringent latency requirements

The talk will have a mix of presentation (slides), benchmarking, live demos, and audience Q&A.

Target Audience

Engineers, researchers and data architects with an interest in:

Massively parallel distributed compute platforms
The internals of existing and emerging compute engines
Composable open data platforms with an emphasis on object-store based data lakes and lakehouses with open table formats like Delta lake, Iceberg, and Hudi

Outline

With a query’s lifecycle as the frame of reference, we start with examining the strengths and weaknesses of the present engines.

While most distributed compute engines are available as Open Source (OSS) as well Commercial Open Source Software, all of them share commonalities on the following areas:
A - Monolithic, stateful and “VM-centric” Architectures
B - Centralized and static approach to distributed processing and execution

We then present how a clean-slate approach helped us build a system that overcomes the key limitations through the use of:
A - Disaggregated, stateless, and “kubernetes-native” Architecture
B - Decentralized and dynamic approach to distributed processing and execution

Takeaways, Impact

We will present findings from real-world workloads around how this new approach drives benefits across evaluation criteria that matter to platform engineering teams:

1 - A materially superior Price-Performance curve
2 - Eliminating system-wide Single Points of Failure (SPOF)
3 - Maintaining Deterministic tail latencies (p99) even under heavy loads and massive variability
4 - Efficient cluster utilization even when faced with data skew, variable task completion, etc.

All submissions

Previous Next

Comments

Jul 2024

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri

13 Sat 09:00 AM – 06:05 PM IST

14 Sun

Hosted by

The Fifth Elephant

Jump starting better data engineering and AI futures

Supported by

Gold Sponsor

Atlassian

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Silver Sponsor

Google

Together, we can build for everyone.

Workshop sponsor

Datastax

Datastax, the real-time AI Company.

Lanyard Sponsor

Uber

We reimagine the way the world moves for the better.

Sponsor

Monster API

MonsterAPI is an easy and cost-effective GenAI computing platform designed for developers to quickly fine-tune, evaluate and deploy LLMs for businesses.

Community Partner

FOSS United Foundation

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Beverage Partner

BONOMI

BONOMI is a ready to drink beverage brand based out of Bangalore. Our first segment into the beverage category is ready to drink cold brew coffee.