Tickets

Loading…

Srikanth Sundarrajan

@sriksun Presenter

Driving ML experimentation without breaking the infra budget

Submitted Jun 8, 2025

At InMobi, data is foundational to everything we do, powering everything from personalization to predictive modeling across our products and platforms. Our data platform ingests, processes, and stores petabytes of data, and supports a wide spectrum of users, from product analysts to ML engineers.

In this talk, I’ll walk through how we’ve architected the platform to enable rapid experimentation for data scientists, while keeping infrastructure costs in check, a balancing act critical to our success.

We’ll begin with an overview of InMobi’s data ecosystem and technical stack, covering our use of distributed storage, Spark for large-scale compute, and the orchestration tools that bind it all together. From there, I’ll motivate why fast turnaround times for ML experiments from feature engineering to model training are crucial to InMobi’s applied science workflows. The need for fast iterations must be met without sacrificing resource efficiency, especially at our scale.

This led us to define three core tenets that guide how our platform is designed and optimized:

  1. What is stored - minimizing redundant and stale data, preferring late materialization and pointer-based joins.

  2. What is processed - structuring compute patterns to limit unnecessary shuffles and redundant reads.

  3. How efficiently we process it - the focus of the rest of this talk, especially around Spark.

In particular, we’ve invested deeply in instrumentation and observability within Spark. We extended the Spark Event Listener interface to extract rich runtime metrics, configuration state, and query plans. But unlike basic Spark UIs or log aggregators, our observability stack is not a matter of fact event history. We use it to surface performance bottlenecks, suboptimal parallelism, and other tuning opportunities.

Building on this, we’ve integrated with Vertex AI’s Agents Developer Kit (ADK) to develop a multi-agent recommender system. These agents collaboratively reason over Spark metrics, source code context, prior review history, and active Git branches to suggest tuning recommendations, auto-generate pull requests, and flag regressions. The goal is to not just observe but act on inefficiencies.

We orchestrate this flow periodically, using job metadata and cost traces to drive down infrastructure waste over time both proactively and as part of postmortem feedback loops.

If time permits, we’ll walk through a minimalist demo of this flow end-to-end.

We’ll conclude by sharing some key learnings and outcomes including measurable cost savings, reduced iteration time for data scientists, and improved visibility across stakeholders. Finally, we’ll look at what’s next: expanding beyond Spark, generalizing the recommender agent framework, and making performance tuning collaborative, explainable, and self-correcting by design.

Bio

Srikanth Sundarrajan is a seasoned architect with over 25 years of industry experience, including more than 15 years specializing in large-scale data processing and distributed systems. A passionate open-source advocate, he is a member of the Apache Software Foundation and has served on the Project Management Committees (PMC) of several Apache projects. Currently, he leads platform initiatives at InMobi Technologies, driving innovation and scalability across their systems.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid Access Ticket

Hosted by

Jump starting better data engineering and AI futures

Supported by

Gold sponsor

Sahaj is an artisanal technology services company crafting purpose-built AI and data-led solutions for businesses.

Gold sponsor

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Gold sponsor

Together, we can build for everyone.

Bronze sponsor & Swag sponsor

AI-Powered Upskilling for Modern Data Professionals

Bronze sponsor

Thoughtworks is a pioneering global technology consultancy, leading the charge in custom software development and technology innovation.

Community partner

Grace Hopper Celebration India 2025, hosted by AnitaB.org India, is Asia’s largest gathering of women and allies in technology.

Community partner

Bengaluru Systems Meetup

Community partner

Build your own homelab server rack at The Fifth Elephant