A
Anudeep
@anudeepk
Transactional Bottlenecks to Lightning-Fast Analytics
Submitted Mar 30, 2025
Topic of your submission:
Distributed data systems
Type of submission:
30 mins talk
I am submitting for:
Rootconf Annual Conference 2025
This Session we will talk about how we used change data capture as a means to scale up perfromance and reliability of analytics when combined with scaleable state management like HUDI and powerful OLAP engine like Trino, increase database reliability by offloading analytics workload and cater to our data governance needs.
The Session also touches on the aspects of architecture which made this solution not just a one-problem solution but a platform that cut across multiple teams to on board their workloads as well to same pattern without writting a single piece of code. The platform currently replicates 200+ tables (and growing) from 5+ different databases at sub-minute latency which resulted in reduction on 50000+ QPS on those databases, Improved analytical queries touching those DBs by 10X and resulted in reducing platform cost by 10% by reducing read-replica counts of DBs.
The session will also talk about how Debezium monitors for changes to the tables under observation and streams those to kafka topic. The real intelligence lies in the consumer of those changes and how those are interpreted for different use cases we wished to achieve by the means of CDC - one of those being building a analytical store using HUDI - a lakehouse table format. We will be touching upon some features of HUDI that makes it apt for being a data store sibling to RDBMS (and even nosql DBMS) for running analytical workload. We will also touch upon why the traditional DBMS are insufficient or too expensive for analytics.
We will talk a bit about Trino and how it leverages columnar fileformats and metadata of these lakehouse formats to execute queries, which scales horizontally. We used Trino but HUDI likes lakehouse formats opens up a wide possibility of query engines to be used depending on your data volumes, ops and spending capabilities. The concepts are tranferable to other formats such as iceberg and deltalake and several other query engines - managed or self deployed, distributed or single node.
Some of the aspects have been touched upon by this blog : https://medium.com/allthatscales/from-transactional-bottlenecks-to-lightning-fast-analytics-74e0d3fff1c0
The audience of this session will take away concepts of CDC, how debezium works and how can one leverage CDC to serve multiple use cases in their pipelines. Learning about HUDI will help them understand how lakehouse formats are changing the way data management is approached - specially for large scale analytics and how concepts of disaggregated storage, decoupling compute and storage not only help improve costs of the platform but also prevent getting tied down to query engines and databases - for a lot of use cases. When all that works together you get a poweful platform that improves the relaibility and cost of databases and makes analytics a breeze.
I am the Head of data and platform engineering at Uptycs, Inc.- a CNAPP company that develops cycber security solutions for:
EDR, XDR, CWPP, CIEM, CSPM, KSPM, SSPM, SCSM, AISPM, DSPM, ... and more ... with an aim to provide a unified platform that gives an enterprise ability to manage security of its entire infrastructure from code to runtime. What it translates to is a ExaByte scale data platform that ingests 100+Million EPS i.e close to Petabyte of data daily, runs 500k+ queries daily that ends up scanning 500TB+ data daily.
I will be presenting with https://www.linkedin.com/in/aakashsankritya/ - a brilliant Data engineer from Uptycs who recently moved to Swiggy - and coauthored the blog.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}