Rows, columns, and consequences

Speak at Rootconf’s Special Edition on Databases

Vasudev Jamwal

@vasujamwal

What Breaks When Aerospike Hits 6 Million QPS

Submitted Apr 30, 2026

Category: War Stories & Lessons Learned


Abstract

When your database is in the critical path of every ad auction, failure isn’t abstract. A misconfigured cluster costs you money in real time. A CPU spike at 1AM means your bidder is throttling while your competitors are not.

InMobi’s DSP runs multiple purpose-built Aerospike clusters on Kubernetes, peaking at 6 million QPS across workloads that have almost nothing in common — real-time user segment lookups, ML embedding serving, frequency-cap enforcement, event deduplication. After years of running this at scale, we’ve collected a set of failures and near-misses that the documentation doesn’t warn you about.

This talk goes through a few of them — incidents where the root cause turned out to be a default we never questioned, a data model decision that looked fine on day one, or a capacity assumption that held until it suddenly didn’t. Each one taught us something we couldn’t have learned without the production traffic to trigger it.

Beyond the failures, we’ll cover what we’ve built around Aerospike to keep it operational: caching layers, circuit-breaker patterns tuned per cluster, and the observability that now gives us early warning before things go wrong.


Key Takeaways

  • Default database configurations are tuned for correctness, not for extreme QPS — understanding what each setting trades off is what separates operating from just running a database
  • Data models that look fine at low scale can become infrastructure problems over time — record growth is a design concern, not just a storage concern
  • Every database has hidden resource costs that only surface at migration time — know your overhead before you need to
  • Resilience at this scale isn’t about the database being reliable — it’s about designing every layer around it to degrade gracefully when it isn’t

Target Audience

  • Senior engineers and architects operating databases in the critical path of production traffic
  • Platform and SRE engineers managing stateful, high-throughput systems on Kubernetes
  • Engineers using or evaluating Aerospike at scale — or running any low-latency key-value store under real load
  • Engineers who have hit — or expect to hit — scaling limits in production and want to know what breaks first and why

Speaker Bios

Shivam Gupta
Shivam is a Staff Software Engineer at InMobi in the DSP platform — the real-time bidding infrastructure that processes millions of ad auctions per second across InMobi’s global footprint.

Vasudev Singh Jamwal
Vasudev is a Senior Engineer at InMobi, working on distributed systems and Aerospike infrastructure on Kubernetes as part of the DSP platform.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy