BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//HasGeek//NONSGML Funnel//EN
DESCRIPTION:Hands-on workshop going from zero to a functional query engine
  - Rootconf Topical Edition on Databases
X-WR-CALDESC:Hands-on workshop going from zero to a functional query engin
 e - Rootconf Topical Edition on Databases
NAME:Build a SQL query engine from scratch
X-WR-CALNAME:Build a SQL query engine from scratch
REFRESH-INTERVAL;VALUE=DURATION:PT12H
SUMMARY:Build a SQL query engine from scratch
TIMEZONE-ID:Asia/Kolkata
X-PUBLISHED-TTL:PT12H
X-WR-TIMEZONE:Asia/Kolkata
BEGIN:VEVENT
SUMMARY:Build a SQL query engine from scratch
DTSTART:20260612T080000Z
DTEND:20260612T120000Z
DTSTAMP:20260607T060625Z
UID:session/NNPRVgDuHrKS2Ck5Qp5v8t@hasgeek.com
SEQUENCE:19
CREATED:20260607T020618Z
DESCRIPTION:## Target audience\n- Engineers who use databases (especially 
 analytics databases) and want to understand what happens after they hit "r
 un" on a query\n- No prior database internals knowledge required\n- We onl
 y expect some basic familiarity with Python (see below for why python)\n\n
 ## Workshop overview\nDuration: 4 hours (3 hours hands-on\, 40 mins of bre
 aks\, 20 mins of discussing real-world engines and questions)\n\nWe first 
 look at some simple SQL queries like\n- `select * from ...`\n- `select a\,
  b from ...`\n- `select ... where y > 0`\n- `select sum(x) ...`\n\nand wri
 te python scripts by hand for each them. This gives us a starting point fo
 r _what_ the engine should do. We then start building a proper engine -- r
 eading data from Parquet files\, SQL parsing\, "operator" model (aka Volca
 no model). We then build operators one-by-one: projections\, filters\, agg
 regations and joins.\n\nMid-way we switch the execution from row-based to 
 vectorized (columnar)\, supported by real benchmarks. This gives us a feel
  for one of the most important optimizations in modern query engines.\n\nW
 e adopt a codecrafters-inspired model\, where each stage has tests and ben
 chmarks that need to pass.\n\n## Learning outcomes\nBy the end of the work
 shop\, participants will be able to understand:\n\n- Compiled vs pipelined
  execution models\n- How group-by (aggregations) and joins work internally
 \n- Volcano operator model (open / next / close)\n- SQL → AST → logica
 l plan → physical plan\n- Row vs columnar layout\n- Row-based vs vectori
 zed execution\n\n## Workshop stages\n1. **Compiled Execution** — Python 
 script that produces output for `select *`\, `select x where y`\, etc. Sta
 rting point for _what_ the engine must do.\n2. **Tablescan** — Here we s
 tart with the _how_. Read Parquet and output rows. Segue into the Volcano 
 model while building this.\n3. **Volcano Model (Theory)** — Operator tre
 es using simple open()\, next()\, close() methods. Operators are the unit 
 of composition in query engines.\n4. **SQL to Plan (Theory)** — Use sqlp
 arser to turn a SQL string into an AST. Walk through AST → physical plan
  (actual functions).\n5. (SKIPPED) Logical plans: all real engines first c
 onvert the AST to a logical plan. This is where query optimizations like j
 oin ordering\, converting subqueries to joins\, etc are actually performed
 . We skip this to keep this workshop focused.\n6. **Projection** — the `
 select` part of a query.\n  - (SKIPPED) expression simplification\, dictio
 nary optimizations are not implemented. We briefly mention them.\n7. **Fil
 ter** — the `where` part of a query. We also introduce filter pushdown (
 at a row group level).\n  - (SKIPPED) partition pruning\, parquet page pru
 ning\, later materialization are some optimizations that modern query engi
 nes use. We won't implement them.\n8. **Aggregation** — group-bys and ag
 gregate functions like sum()\, avg()\, etc are performed by this operator.
  We implement a row-based version and then segue into vectorized execution
 .\n  - (SKIPPED) multi-threaded aggregations\, sort-based aggregations are
  not covered.\n9. **Vectorized Execution** — Rewrite the engine (all ope
 rators) to process column batches instead of single rows. Benchmarks make 
 the difference tangible.\n10. **Joins** — the `join` part of a query. We
  look at two join algorithms: nested-loop and hash join.\n   - (SKIPPED) o
 uter joins\, anti-join\, semi-join. We only look at `inner join`.\n   - (S
 KIPPED) sort-merge join\, perfect hash join\, multi-threaded hash join\, d
 istributed broadcast join\, distributed shuffled join are some alternate j
 oin implementations that we won't look at.\n11. **Further Reading** — Lo
 gical plan\, join ordering\, plan optimisation\, Parquet file format\, mul
 ti-threaded execution\, distributed execution. Papers: [Volcano](https://p
 aperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf)\, [MonetDB]
 (https://15721.courses.cs.cmu.edu/spring2024/papers/04-execution1/boncz-ci
 dr2005.pdf)\, [Morsel-Driven](https://15721.courses.cs.cmu.edu/spring2024/
 papers/08-scheduling/p743-leis.pdf)\, [Compiled vs Vectorized](https://157
 21.courses.cs.cmu.edu/spring2024/papers/06-vectorization/p2209-kersten.pdf
 ).\n\n## Tooling\nPython\, pyarrow\, sqlparse\n\nA skeleton repo with samp
 le data will be provided. Please go through the set up steps before attend
 ing the workshop - no additional time will be provided for setup. [`uv`](h
 ttps://docs.astral.sh/uv/) will be the main tool we will be using\, please
  use it! (We cannot help with issues arising from raw `pip` or system pyth
 on usage).\n\nRepository link (TODO)\n\n### Why Python?\nIt's true that no
  query engine is written (purely) in Python. The engines we have worked on
  are in Rust and Java. In this workshop\, we want to focus on the fundamen
 tals instead of the specifics of a language. We believe that Python mostly
  gets out of the way\, while also providing some of the libraries we need 
 - namely pyarrow and sqlparse.\n\n### About LLM usage\nDuring the workshop
 \, please type out the code by hand. You are free to use LLMs to understan
 d things or ask questions\, but we would prefer if you ask us instead :)\n
 [Here](https://haskellforall.com/2026/05/type-out-the-code) is an article 
 which explains why typing out the code is much better for learning than an
  LLM doing it for you (and better than the old-school way of copy-pasting 
 code).\n\n## About the instructors\n**Aayush Naik** and **Samyak Sarnayak*
 *: we work at the query engine team at e6data. We have worked on both the 
 older Java-based engine (which was pretty much written from scratch) and t
 he new Rust-based engine (using Apache DataFusion as the base).\n- [Aayush
  Naik](https://www.linkedin.com/in/naikaayush/): has worked on Decimal128 
 data-type support for the java engine and is currently working on adding D
 ML support on Deltalake tables to the rust engine.\n- [Samyak Sarnayak](ht
 tps://www.linkedin.com/in/samyaks/): has worked on variant data-type suppo
 rt\, distributed shuffled hash joins for the java engine and is currently 
 working on bringing in distributed query execution to the rust engine.\n\n
 ## How to attend this workshop\nThis workshop is open for **[Rootconf memb
 ers](https://hasgeek.com/rootconf#memberships)** and for **[Rootconf Datab
 ase Edition ticket buyers](https://hasgeek.com/rootconf/topical-edition-on
 -databases#tickets)**\n\n**This workshop is open to 30 participants (in-pe
 rson) & hybrid access for remote attendees. Seats for in-person participan
 ts will be available on first-come-first-served basis.** 🎟️    \n\n##
  Contact information ☎️\nFor inquiries about the workshop\, contact [+
 91-7676332020](tel:+917676332020) or write to [info@hasgeek.com](mailto:in
 fo@hasgeek.com).
LAST-MODIFIED:20260607T020621Z
LOCATION:Bangalore - https://hasgeek.com/rootconf/build-a-sql-query-engine
 -from-scratch-workshop/
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
URL:https://hasgeek.com/rootconf/build-a-sql-query-engine-from-scratch-wor
 kshop/
BEGIN:VALARM
ACTION:display
DESCRIPTION:Build a SQL query engine from scratch in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
END:VCALENDAR
