Rootconf

Build a SQL query engine from scratch

Name: Build a SQL query engine from scratch
Start: 2026-06-12T13:30:00+05:30
End: 2026-06-12T17:30:00+05:30
Location: Sahaj Software

Hands-on workshop going from zero to a functional query engine - Rootconf Topical Edition on Databases

Jun 2026

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri 01:30 PM – 05:30 PM IST

13 Sat

14 Sun

Sahaj Software, Bengaluru,

Jun 2026

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri 01:30 PM – 05:30 PM IST

13 Sat

14 Sun

Sahaj Software, Bengaluru,

Tickets

Pinned update

Slides and code links Hi everyone! Thank you for attending our workshop. Although we couldn’t cover everything, we hope you learnt a few things about how query engines work intern… more

🚨 Venue changed. This workshop will take place at Sahaj Software.

Target audience

Engineers who use databases (especially analytics databases) and want to understand what happens after they hit “run” on a query
No prior database internals knowledge required
We only expect some basic familiarity with Python (see below for why python)

Workshop overview

Duration: 4 hours (3 hours hands-on, 40 mins of breaks, 20 mins of discussing real-world engines and questions)

We first look at some simple SQL queries like

select * from ...
select a, b from ...
select ... where y > 0
select sum(x) ...

and write python scripts by hand for each them. This gives us a starting point for what the engine should do. We then start building a proper engine -- reading data from Parquet files, SQL parsing, “operator” model (aka Volcano model). We then build operators one-by-one: projections, filters, aggregations and joins.

Mid-way we switch the execution from row-based to vectorized (columnar), supported by real benchmarks. This gives us a feel for one of the most important optimizations in modern query engines.

We adopt a codecrafters-inspired model, where each stage has tests and benchmarks that need to pass.

Learning outcomes

By the end of the workshop, participants will be able to understand:

Compiled vs pipelined execution models
How group-by (aggregations) and joins work internally
Volcano operator model (open / next / close)
SQL → AST → logical plan → physical plan
Row vs columnar layout
Row-based vs vectorized execution

Workshop stages

Compiled Execution — Python script that produces output for select *, select x where y, etc. Starting point for what the engine must do.
Tablescan — Here we start with the how. Read Parquet and output rows. Segue into the Volcano model while building this.
Volcano Model (Theory) — Operator trees using simple open(), next(), close() methods. Operators are the unit of composition in query engines.
SQL to Plan (Theory) — Use sqlparser to turn a SQL string into an AST. Walk through AST → physical plan (actual functions).
(SKIPPED) Logical plans: all real engines first convert the AST to a logical plan. This is where query optimizations like join ordering, converting subqueries to joins, etc are actually performed. We skip this to keep this workshop focused.
Projection — the select part of a query.

(SKIPPED) expression simplification, dictionary optimizations are not implemented. We briefly mention them.

Filter — the where part of a query. We also introduce filter pushdown (at a row group level).

(SKIPPED) partition pruning, parquet page pruning, later materialization are some optimizations that modern query engines use. We won’t implement them.

Aggregation — group-bys and aggregate functions like sum(), avg(), etc are performed by this operator. We implement a row-based version and then segue into vectorized execution.

(SKIPPED) multi-threaded aggregations, sort-based aggregations are not covered.

Vectorized Execution — Rewrite the engine (all operators) to process column batches instead of single rows. Benchmarks make the difference tangible.
Joins — the join part of a query. We look at two join algorithms: nested-loop and hash join.

(SKIPPED) outer joins, anti-join, semi-join. We only look at inner join.
(SKIPPED) sort-merge join, perfect hash join, multi-threaded hash join, distributed broadcast join, distributed shuffled join are some alternate join implementations that we won’t look at.

Further Reading — Logical plan, join ordering, plan optimisation, Parquet file format, multi-threaded execution, distributed execution. Papers: Volcano, MonetDB, Morsel-Driven, Compiled vs Vectorized.

Tooling

Python, pyarrow, sqloxide

A skeleton repo with sample data will be provided. Please go through the set up steps before attending the workshop - no additional time will be provided for setup. uv will be the main tool we will be using, please use it! (We cannot help with issues arising from raw pip or system python usage).

Code repository link

Why Python?

It’s true that no query engine is written (purely) in Python. The engines we have worked on are in Rust and Java. In this workshop, we want to focus on the fundamentals instead of the specifics of a language. We believe that Python mostly gets out of the way, while also providing some of the libraries we need - namely pyarrow and sqlparse.

About LLM usage

During the workshop, please type out the code by hand. You are free to use LLMs to understand things or ask questions, but we would prefer if you ask us instead :)
Here is an article which explains why typing out the code is much better for learning than an LLM doing it for you (and better than the old-school way of copy-pasting code).

About the instructors

Aayush Naik and Samyak Sarnayak: we work at the query engine team at e6data. We have worked on both the older Java-based engine (which was pretty much written from scratch) and the new Rust-based engine (using Apache DataFusion as the base).

Aayush Naik: has worked on Decimal128 data-type support for the java engine and is currently working on adding DML support on Deltalake tables to the rust engine.
Samyak Sarnayak: has worked on variant data-type support, distributed shuffled hash joins for the java engine and is currently working on bringing in distributed query execution to the rust engine.

How to attend this workshop

This workshop is open for Rootconf members and for Rootconf Database Edition ticket buyers

This workshop is open to 30 participants (in-person) & hybrid access for remote attendees. Seats for in-person participants will be available on first-come-first-served basis. 🎟️

Contact information ☎️

For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com.

Venue

Sahaj Software

3rd Floor, Sulochana Building,

365, 1st Cross Rd, 3rd Block, Santhosapuram, Koramangala 3 Block,

Bengaluru, - 560034

Karnataka, IN

Loading…

Hosted by

Rootconf

We care about site reliability, cloud costs, security and data privacy

Supported by

Venue host

Sahaj Software

Sahaj is an artisanal technology services company crafting purpose-built AI and data-led solutions for businesses.

Related events

Topical Edition on Databases: It worked in theory. Let’s talk about production.

Venue

Loading…