R
Robin
Streaming Data Lakehouse at Scale: Learnings from building a 100TB+/Day Near Real-Time Lakehouse with Apache Flink and Iceberg
Submitted Apr 20, 2025
Topic of your submission:
Distributed data systems
Type of submission:
30 mins talk
I am submitting for:
Rootconf Annual Conference 2025
How do you stream large amount of data into Apache Iceberg tables from multiple data centers and cloud providers — and make it queryable in under 15 minutes? At Flipkart Data Platform, we have gone through this journey and would like to share our learnings and practical challenges of solving this at scale.
Flipkart’s data platform ingests over 120TB+ (compressed) of data every single day, and the complexity of handling this data has grown significantly with our shift toward near real-time analytics and a hybrid cloud environment.
Traditional batch-oriented ingestion tools like Camus were no longer sufficient to meet the evolving needs of the platform — especially in the areas of:
-
Observability: How can we track data availability and freshness precisely? Is there a 5-minute lag or 5-hour delay?
-
Hybrid Cloud Compatibility: How do we maintain data guarantees and observability when data is generated across multiple on-prem and cloud regions?
-
Cost Efficiency: How can we reduce cloud spend by leveraging idle on-prem compute during off-peak hours?
-
Incremental Processing: How do we support complex stateful operations like window joins with data completeness guarantees, even under high-throughput streaming scenarios?
In this talk, we’ll share our experiences building, deploying, and operating the Streaming Lakehouse architecture developed by the Flipkart Data Platform, under a project code-named Lambert. Lambert leverages open source technologies like Apache Flink and Apache Iceberg to ingest and make data queryable within 15 minutes, while delivering strong consistency, completeness, and observability guarantees — even across hybrid cloud environments.
We’ll also dive into how we leverage Flink’s core capabilities to measure and enforce end-to-end data freshness, and why there is a need for custom Watermarking solution to provide deep insights into pipeline lag and incremental data completeness.
We’ll also cover our cost optimization strategies using on-prem excess compute during off-peak hours, the use of Flink Autoscaler to handle shifting traffic patterns, and the technical challenges in building a unified Iceberg table over data generated across multiple cloud and on-prem regions.
Key Takeaways:
- Core building blocks for architecting a real-time lakehouse platform in a hybrid cloud setup
- How to enforce data freshness SLAs using custom observability tooling
- Cloud cost optimization strategies — how we dynamically shift compute loads between on-prem and cloud
- Dynamic scaling with Flink Autoscaler to handle fluctuating traffic in real time
- Challenges (and solutions) in building a unified Iceberg table across multiple clouds and regions
Intended Audience:
- Data engineers building or operating real-time data pipelines at scale
- Platform and infrastructure teams managing hybrid cloud or multi-region setups
- Architects and tech leads designing streaming lakehouse or data lake architectures
- SREs and observability engineers focused on data freshness, availability, and cost efficiency
About us:
Robin is a Principal Architect at Flipkart’s Data Platform team with over 16 years of experience in building high-scale, distributed systems. He has been instrumental in driving the architecture and evolution of Flipkart’s data infrastructure, with a strong focus on scalability, reliability, and cost efficiency.
Bharat is an SDE-3 at Flipkart’s Data Platform team, specializing in building large-scale streaming data pipelines. With deep expertise in technologies like Apache Flink, Apache Kafka, and Spark Streaming, Bharat has led the development of several critical components of Flipkart’s real-time data processing stack.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}