The Fifth Elephant

The Fifth Elephant 2025 Annual Conference

Less hype. More engineering.

Jul 2025

14 Mon

15 Tue

16 Wed

17 Thu

18 Fri

19 Sat 08:45 AM – 05:55 PM IST

20 Sun

Bangalore International Centre, Bangalore

Tickets

All submissions

Previous Next

This submission has been added to the schedule

Driving ML experimentation without breaking the infra budget

Submitted Jun 8, 2025

I am submitting for: Speaking at the Fifth Elephant 2025 Annual Conference Type of submission: 30 mins talk Choose the topic your submission falls under: Data & ML Infrastructure track

At InMobi, data is foundational to everything we do, powering everything from personalization to predictive modeling across our products and platforms. Our data platform ingests, processes, and stores petabytes of data, and supports a wide spectrum of users, from product analysts to ML engineers.

In this talk, I’ll walk through how we’ve architected the platform to enable rapid experimentation for data scientists, while keeping infrastructure costs in check, a balancing act critical to our success.

We’ll begin with an overview of InMobi’s data ecosystem and technical stack, covering our use of distributed storage, Spark for large-scale compute, and the orchestration tools that bind it all together. From there, I’ll motivate why fast turnaround times for ML experiments from feature engineering to model training are crucial to InMobi’s applied science workflows. The need for fast iterations must be met without sacrificing resource efficiency, especially at our scale.

This led us to define three core tenets that guide how our platform is designed and optimized:

What is stored - minimizing redundant and stale data, preferring late materialization and pointer-based joins.
What is processed - structuring compute patterns to limit unnecessary shuffles and redundant reads.
How efficiently we process it - the focus of the rest of this talk, especially around Spark.

In particular, we’ve invested deeply in instrumentation and observability within Spark. We extended the Spark Event Listener interface to extract rich runtime metrics, configuration state, and query plans. But unlike basic Spark UIs or log aggregators, our observability stack is not a matter of fact event history. We use it to surface performance bottlenecks, suboptimal parallelism, and other tuning opportunities.

Building on this, we’ve integrated with Vertex AI’s Agents Developer Kit (ADK) to develop a multi-agent recommender system. These agents collaboratively reason over Spark metrics, source code context, prior review history, and active Git branches to suggest tuning recommendations, auto-generate pull requests, and flag regressions. The goal is to not just observe but act on inefficiencies.

We orchestrate this flow periodically, using job metadata and cost traces to drive down infrastructure waste over time both proactively and as part of postmortem feedback loops.

If time permits, we’ll walk through a minimalist demo of this flow end-to-end.

We’ll conclude by sharing some key learnings and outcomes including measurable cost savings, reduced iteration time for data scientists, and improved visibility across stakeholders. Finally, we’ll look at what’s next: expanding beyond Spark, generalizing the recommender agent framework, and making performance tuning collaborative, explainable, and self-correcting by design.

Bio

Srikanth Sundarrajan is a seasoned architect with over 25 years of industry experience, including more than 15 years specializing in large-scale data processing and distributed systems. A passionate open-source advocate, he is a member of the Apache Software Foundation and has served on the Project Management Committees (PMC) of several Apache projects. Currently, he leads platform initiatives at InMobi Technologies, driving innovation and scalability across their systems.

All submissions

Previous Next

Comments

Jul 2025

14 Mon

15 Tue

16 Wed

17 Thu

18 Fri

19 Sat 08:45 AM – 05:55 PM IST

20 Sun

Hybrid Access Ticket

Hosted by

The Fifth Elephant

Jumpstart better data engineering and AI futures

Supported by

Gold sponsor

Sahaj Software

Sahaj is an artisanal technology services company crafting purpose-built AI and data-led solutions for businesses.

Gold sponsor

Atlassian

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Gold sponsor