Building Low-Latency ML Inference at Ad-serving Scale

Nov 2025

3 Mon

4 Tue

5 Wed

6 Thu

7 Fri

8 Sat 02:45 PM – 05:30 PM IST

9 Sun

Inmobi, Bengaluru

All submissions

Previous Next

This submission has been added to the schedule

Building Low-Latency ML Inference at Ad-serving Scale

Submitted Oct 6, 2025

Session type: Talk (30 mins)

Ever wondered what it takes to run a low-latency, high-throughput ML platform serving over 3 million requests per second? In this talk, we’ll dive deep into the engineering challenges of operating at extreme scale — from handling synchronous ML inference in real-time ad serving to ensuring predictable latency under heavy load. We’ll explore the evolution of our architecture — how Jetty-based application logic interacts with TensorFlow Serving in a latency-critical path.

We’ll discuss real-world lessons on ML inference optimization, timeout budgeting, query planner to optimise the I/O calls . You’ll learn how we tuned system configs, optimized model serving, and built guardrails for graceful degradation without hurting user experience — all while running in production at massive scale.

Key Takeaways:

Practical strategies to reduce ML inference latency in high-QPS environments.
Design patterns and system configs that help achieve consistent sub-100ms response times under real-world load.

Who Should Attend:

Backend Engineers building latency-sensitive or ML-driven systems.
MLOps & Infra Engineers managing large-scale model deployments.
Architects designing scalable, fault-tolerant inference and serving pipelines.

Speaker Bio
Shivam Gupta
Staff Engineer @Inmobi
DSP team

All submissions

Previous Next

Comments

Nov 2025

3 Mon

4 Tue

5 Wed

6 Thu

7 Fri

8 Sat 02:45 PM – 05:30 PM IST

9 Sun

Hosted by

Rootconf

We care about site reliability, cloud costs, security and data privacy

Supported by

Meet-up sponsor

InMobi Technologies

Platform Engineering meet-up - Nov 8

Building Low-Latency ML Inference at Ad-serving Scale

Comments