- Engineers who use databases (especially analytics databases) and want to understand what happens after they hit “run” on a query
- No prior database internals knowledge required
- We only expect some basic familiarity with Python (see below for why python)
Duration: 4 hours (3 hours hands-on, 40 mins of breaks, 20 mins of discussing real-world engines and questions)
We first look at some simple SQL queries like
select * from ...
select a, b from ...
select ... where y > 0
select sum(x) ...
and write python scripts by hand for each them. This gives us a starting point for what the engine should do. We then start building a proper engine -- reading data from Parquet files, SQL parsing, “operator” model (aka Volcano model). We then build operators one-by-one: projections, filters, aggregations and joins.
Mid-way we switch the execution from row-based to vectorized (columnar), supported by real benchmarks. This gives us a feel for one of the most important optimizations in modern query engines.
We adopt a codecrafters-inspired model, where each stage has tests and benchmarks that need to pass.
By the end of the workshop, participants will be able to understand:
- Compiled vs pipelined execution models
- How group-by (aggregations) and joins work internally
- Volcano operator model (open / next / close)
- SQL → AST → logical plan → physical plan
- Row vs columnar layout
- Row-based vs vectorized execution
- Compiled Execution — Python script that produces output for
select *, select x where y, etc. Starting point for what the engine must do.
- Tablescan — Here we start with the how. Read Parquet and output rows. Segue into the Volcano model while building this.
- Volcano Model (Theory) — Operator trees using simple open(), next(), close() methods. Operators are the unit of composition in query engines.
- SQL to Plan (Theory) — Use sqlparser to turn a SQL string into an AST. Walk through AST → physical plan (actual functions).
- (SKIPPED) Logical plans: all real engines first convert the AST to a logical plan. This is where query optimizations like join ordering, converting subqueries to joins, etc are actually performed. We skip this to keep this workshop focused.
- Projection — the
select part of a query.
- (SKIPPED) expression simplification, dictionary optimizations are not implemented. We briefly mention them.
- Filter — the
where part of a query. We also introduce filter pushdown (at a row group level).
- (SKIPPED) partition pruning, parquet page pruning, later materialization are some optimizations that modern query engines use. We won’t implement them.
- Aggregation — group-bys and aggregate functions like sum(), avg(), etc are performed by this operator. We implement a row-based version and then segue into vectorized execution.
- (SKIPPED) multi-threaded aggregations, sort-based aggregations are not covered.
- Vectorized Execution — Rewrite the engine (all operators) to process column batches instead of single rows. Benchmarks make the difference tangible.
- Joins — the
join part of a query. We look at two join algorithms: nested-loop and hash join.
- (SKIPPED) outer joins, anti-join, semi-join. We only look at
inner join.
- (SKIPPED) sort-merge join, perfect hash join, multi-threaded hash join, distributed broadcast join, distributed shuffled join are some alternate join implementations that we won’t look at.
- Further Reading — Logical plan, join ordering, plan optimisation, Parquet file format, multi-threaded execution, distributed execution. Papers: Volcano, MonetDB, Morsel-Driven, Compiled vs Vectorized.
Python, pyarrow, sqlparse
A skeleton repo with sample data will be provided. Please go through the set up steps before attending the workshop - no additional time will be provided for setup. uv will be the main tool we will be using, please use it! (We cannot help with issues arising from raw pip or system python usage).
Repository link (TODO)
It’s true that no query engine is written (purely) in Python. The engines we have worked on are in Rust and Java. In this workshop, we want to focus on the fundamentals instead of the specifics of a language. We believe that Python mostly gets out of the way, while also providing some of the libraries we need - namely pyarrow and sqlparse.
During the workshop, please type out the code by hand. You are free to use LLMs to understand things or ask questions, but we would prefer if you ask us instead :)
Here is an article which explains why typing out the code is much better for learning than an LLM doing it for you (and better than the old-school way of copy-pasting code).
Aayush Naik and Samyak Sarnayak: we work at the query engine team at e6data. We have worked on both the older Java-based engine (which was pretty much written from scratch) and the new Rust-based engine (using Apache DataFusion as the base).
- Aayush Naik: has worked on Decimal128 data-type support for the java engine and is currently working on adding DML support on Deltalake tables to the rust engine.
- Samyak Sarnayak: has worked on variant data-type support, distributed shuffled hash joins for the java engine and is currently working on bringing in distributed query execution to the rust engine.
This workshop is open for Rootconf members and for Rootconf Database Edition ticket buyers
This workshop is open to 30 participants (in-person) & hybrid access for remote attendees. Seats for in-person participants will be available on first-come-first-served basis. 🎟️
For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com.