The Fifth Elephant 2025 Annual Conference CfP
Speak at The Fifth Elephant 2025 Annual Conference
Srikanth Sundarrajan
@sriksun Presenter
Submitted Jun 8, 2025
At InMobi, data is foundational to everything we do, powering everything from personalization to predictive modeling across our products and platforms. Our data platform ingests, processes, and stores petabytes of data, and supports a wide spectrum of users, from product analysts to ML engineers.
In this talk, I’ll walk through how we’ve architected the platform to enable rapid experimentation for data scientists, while keeping infrastructure costs in check, a balancing act critical to our success.
We’ll begin with an overview of InMobi’s data ecosystem and technical stack, covering our use of distributed storage, Spark for large-scale compute, and the orchestration tools that bind it all together. From there, I’ll motivate why fast turnaround times for ML experiments from feature engineering to model training are crucial to InMobi’s applied science workflows. The need for fast iterations must be met without sacrificing resource efficiency, especially at our scale.
This led us to define three core tenets that guide how our platform is designed and optimized:
What is stored - minimizing redundant and stale data, preferring late materialization and pointer-based joins.
What is processed - structuring compute patterns to limit unnecessary shuffles and redundant reads.
How efficiently we process it - the focus of the rest of this talk, especially around Spark.
In particular, we’ve invested deeply in instrumentation and observability within Spark. We extended the Spark Event Listener interface to extract rich runtime metrics, configuration state, and query plans. But unlike basic Spark UIs or log aggregators, our observability stack is not a matter of fact event history. We use it to surface performance bottlenecks, suboptimal parallelism, and other tuning opportunities.
Building on this, we’ve integrated with Vertex AI’s Agents Developer Kit (ADK) to develop a multi-agent recommender system. These agents collaboratively reason over Spark metrics, source code context, prior review history, and active Git branches to suggest tuning recommendations, auto-generate pull requests, and flag regressions. The goal is to not just observe but act on inefficiencies.
We orchestrate this flow periodically, using job metadata and cost traces to drive down infrastructure waste over time both proactively and as part of postmortem feedback loops.
If time permits, we’ll walk through a minimalist demo of this flow end-to-end.
We’ll conclude by sharing some key learnings and outcomes including measurable cost savings, reduced iteration time for data scientists, and improved visibility across stakeholders. Finally, we’ll look at what’s next: expanding beyond Spark, generalizing the recommender agent framework, and making performance tuning collaborative, explainable, and self-correcting by design.
Srikanth Sundarrajan is a seasoned architect with over 25 years of industry experience, including more than 15 years specializing in large-scale data processing and distributed systems. A passionate open-source advocate, he is a member of the Apache Software Foundation and has served on the Project Management Committees (PMC) of several Apache projects. Currently, he leads platform initiatives at InMobi Technologies, driving innovation and scalability across their systems.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}