Call for submissions: Platform Engineering Meet-ups

Call for submissions: Platform Engineering Meet-ups

Share your journey of building platforms that power engineering teams

Shivam Gupta

@shivamgupta

Building Low-Latency Ad-serving Platforms — Lessons from Running 3 M QPS in Production

Submitted Oct 6, 2025

Ever wondered what it takes to run a low-latency, high-throughput ML platform serving over 3 million requests per second? In this talk, we’ll dive deep into the engineering challenges of operating at extreme scale — from handling synchronous ML inference in real-time ad serving to ensuring predictable latency under heavy load. We’ll explore the evolution of our architecture — how Jetty-based application logic interacts with TensorFlow Serving in a latency-critical path.

We’ll discuss real-world lessons on ML inference optimization, timeout budgeting, caching strategies, and horizontal scalability in Kubernetes. You’ll learn how we tuned system configs, optimized model serving, and built guardrails for graceful degradation without hurting user experience — all while running in production at massive scale.

Key Takeaways:

Practical strategies to reduce ML inference latency in high-QPS environments.
Design patterns and system configs that help achieve consistent sub-100ms response times under real-world load.

Who Should Attend:

Backend Engineers building latency-sensitive or ML-driven systems.
MLOps & Infra Engineers managing large-scale model deployments.
Architects designing scalable, fault-tolerant inference and serving pipelines.

Speaker Bio
Shivam Gupta
Staff Engineer @Inmobi
DSP team

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy