RV
Rohan Vadaje
@rohan_v
The Zen state with HUDI
Submitted Apr 20, 2025
Topic of your submission:
Distributed data systems
Type of submission:
Other
I am submitting for:
Rootconf Annual Conference 2025
{Describe your talk/session in 2-3 paragraphs}
In this session, we’ll talk about how we tackled state management at scale, achieving the benefits of both OLTP and OLAP systems (real-time updates with fast analytical queries ) all while handling billions of records. We’ll walk through how we leveraged Apache Hudi and Apache Spark to power a platform that ingests over 500GB of data per minute, manages a mutable state of 10B+ records, and delivers accurate, low-latency insights to user-facing security dashboards.
At Uptycs, our sensors collect trillions of security events daily from endpoints, containers, cloud services, and identity systems. A key challenge was managing vulnerability and inventory data that continuously evolves. Initially stored as daily ORC files, the system struggled with answering time-sensitive questions like “what changed in the last 24 hours?” due to the cost and latency of full-table scans. Apache Hudi’s upsert and incremental query capabilities allowed us to treat our data lake like a database — giving us the ability to track state changes in near real-time with snapshot isolation.
To scale this architecture, we forked Apache Hudi and made enhancements to support a single Spark job ingesting into 100+ customer-specific Hudi tables. This enabled the creation of a No-Code State Platform, eliminating the operational complexity of managing hundreds of Spark jobs, reducing scheduling delays, and optimizing cluster resource usage. The system supports diverse workloads — from CDC-style vulnerability tracking to inventory snapshots and trend statistics — all in a multi-tenant environment. We’ll also share insights on tuning compaction, clustering, and metadata to make Hudi production-ready at this scale.
Another key innovation was building a custom backpressure mechanism in Spark Structured Streaming — something not natively supported. By monitoring the JVM heap usage, we throttled data ingestion dynamically, preventing OutOfMemory (OOM) errors and ensuring a stable, self-regulating infrastructure even under massive load spikes.
Scale we dealt with using Apache Hudi
Managed 10+ billion mutable records across 100+ logical tables.
Ingested ~500GB/min via a single Spark job powered by a forked Hudi implementation.
Diverse workloads: CDC ingestion, inventory tracking, trend/stat analytics.
Eliminated orchestration complexity with a No-Code ingestion platform.
Improved query latencies and cluster efficiency using clustering, metadata table, and incremental reads.
Enabled real-time insights while keeping costs low and data quality high.
{Mention 1-2 takeaways from your session}
How to use Apache Hudi to bridge the gap between OLTP and OLAP, enabling stateful, real-time, analytical processing over a data lake.
How forking Hudi and customizing it enabled us to build a single, multi-tenant Spark ingestion job, simplifying ops and unlocking massive scale.
How we implemented a custom backpressure mechanism in Spark Streaming based on JVM memory, leading to improved system reliability.
{Which audience segment is your talk/session going to beneficial for?}
This session will benefit data platform engineers, architects, and security data engineers dealing with large-scale, stateful datasets and looking to simplify operations while scaling real-time analytics. It’s also relevant to anyone working on data lakehouse architectures, CDC pipelines, or multi-tenant ingestion systems, Additionally, database enthusiasts who enjoy diving into database internals and storage engine design will find value as Apache Hudi brings database-like capabilities to data lakes
{Add your bio - what you do; where you work}
I’m Rohan, a senior data platform engineer at Nutanix, building large-scale log ingestion systems. Previously at Uptycs, I led the development of a real-time state platform using Apache Hudi, enabling scalable, multi-tenant CDC and inventory tracking. I have 9+ years of experience building big data systems and actively contribute to open-source projects like Apache Hudi and Apache Spark. I’m deeply passionate about distributed systems and love diving into the internals to solve complex data infrastructure problems. I’ll be co-presenting this session with https://www.linkedin.com/in/anudeep-kumar/, Head of Data Platform.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}