Oct 2025
27 Mon
28 Tue
29 Wed
30 Thu
31 Fri
1 Sat
2 Sun
Nov 2025
3 Mon
4 Tue
5 Wed
6 Thu
7 Fri
8 Sat 02:45 PM – 05:30 PM IST
9 Sun
Shivam Gupta
@shivamgupta
Submitted Oct 6, 2025
Ever wondered what it takes to run a low-latency, high-throughput ML platform serving over 3 million requests per second? In this talk, we’ll dive deep into the engineering challenges of operating at extreme scale — from handling synchronous ML inference in real-time ad serving to ensuring predictable latency under heavy load. We’ll explore the evolution of our architecture — how Jetty-based application logic interacts with TensorFlow Serving in a latency-critical path.
We’ll discuss real-world lessons on ML inference optimization, timeout budgeting, query planner to optimise the I/O calls . You’ll learn how we tuned system configs, optimized model serving, and built guardrails for graceful degradation without hurting user experience — all while running in production at massive scale.
Key Takeaways:
Practical strategies to reduce ML inference latency in high-QPS environments.
Design patterns and system configs that help achieve consistent sub-100ms response times under real-world load.
Who Should Attend:
Backend Engineers building latency-sensitive or ML-driven systems.
MLOps & Infra Engineers managing large-scale model deployments.
Architects designing scalable, fault-tolerant inference and serving pipelines.
Speaker Bio
Shivam Gupta
Staff Engineer @Inmobi
DSP team
Hosted by
Supported by
Meet-up sponsor
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}