Jun 2026
8 Mon
9 Tue
10 Wed
11 Thu
12 Fri 09:00 AM – 06:00 PM IST
13 Sat 09:00 AM – 06:00 PM IST
14 Sun
Vasudev Jamwal
@vasujamwal
Submitted Apr 30, 2026
Category: War Stories & Lessons Learned
When your database is in the critical path of every ad auction, failure isn’t abstract. A misconfigured cluster costs you money in real time. A CPU spike at 1AM means your bidder is throttling while your competitors are not.
InMobi’s DSP runs multiple purpose-built Aerospike clusters on Kubernetes, peaking at 6 million QPS across workloads that have almost nothing in common — real-time user segment lookups, ML embedding serving, frequency-cap enforcement, event deduplication. After years of running this at scale, we’ve collected a set of failures and near-misses that the documentation doesn’t warn you about.
This talk goes through a few of them — incidents where the root cause turned out to be a default we never questioned, a data model decision that looked fine on day one, or a capacity assumption that held until it suddenly didn’t. Each one taught us something we couldn’t have learned without the production traffic to trigger it.
Beyond the failures, we’ll cover what we’ve built around Aerospike to keep it operational: caching layers, circuit-breaker patterns tuned per cluster, and the observability that now gives us early warning before things go wrong.
Shivam Gupta
Shivam is a Staff Software Engineer at InMobi in the DSP platform — the real-time bidding infrastructure that processes millions of ad auctions per second across InMobi’s global footprint.
shivam.gupta@inmobi.com
Slide deck
https://docs.google.com/presentation/d/1_-TPbgA_TyI2Iinl-6VhvuHRUy1d4cwzdlxR8ul6Tt0/edit?usp=sharing
Hosted by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}