Platform Engineering meet-up - Nov 8

Platform Engineering meet-up - Nov 8

Real systems. Real engineers. Real lessons.

Shivam Gupta

@shivamgupta

Building Low-Latency ML Inference at Ad-serving Scale

Submitted Oct 6, 2025

Ever wondered what it takes to run a low-latency, high-throughput ML platform serving over 3 million requests per second? In this talk, we’ll dive deep into the engineering challenges of operating at extreme scale — from handling synchronous ML inference in real-time ad serving to ensuring predictable latency under heavy load. We’ll explore the evolution of our architecture — how Jetty-based application logic interacts with TensorFlow Serving in a latency-critical path.

We’ll discuss real-world lessons on ML inference optimization, timeout budgeting, query planner to optimise the I/O calls . You’ll learn how we tuned system configs, optimized model serving, and built guardrails for graceful degradation without hurting user experience — all while running in production at massive scale.

Key Takeaways:

Practical strategies to reduce ML inference latency in high-QPS environments.
Design patterns and system configs that help achieve consistent sub-100ms response times under real-world load.

Who Should Attend:

Backend Engineers building latency-sensitive or ML-driven systems.
MLOps & Infra Engineers managing large-scale model deployments.
Architects designing scalable, fault-tolerant inference and serving pipelines.

Speaker Bio
Shivam Gupta
Staff Engineer @Inmobi
DSP team

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy

Supported by

Meet-up sponsor