Call for submissions: Platform Engineering meet-ups

Share your journey of building platforms that power engineering teams

Parth Agrawal

@ag_parth

Druid at 700 MBps: How We Stopped Babysitting Our Observability Stack

Submitted May 8, 2026

{Describe your session in 2 paragraphs}
Confluent’s observability platform handles around 7 million events per second. Apache Druid is what makes that work. It’s where our metrics land and where every dashboard, alert, and on-call investigation eventually pulls from. This session is about how we actually operate Druid in production: how we lay out ingestion, the data modeling choices that bit us before they helped us, and the kinds of failures you only really learn about by hitting them.
I’ll spend most of the time on the two changes that made the biggest difference for us. The first is how we split our clusters into a scraping tier for ingestion and recent queries, and a historical tier for older data. The second is the set of automations we built to handle the stuff Druid operators see all the time: stuck tasks, segments that won’t balance, queries that run forever, coordinator weirdness. None of it is fancy, but together it’s the difference between getting paged at 3am and not.

{Mention 1-2 takeaways from your session}

  1. How separating workloads, rather than just scaling a single system, can solve a class of reliability problems that more capacity won’t.

  2. A practical way to think about which operational pain points are worth automating away, and which ones aren’t.

{Which audiences is your session going to beneficial for?}
SREs and platform engineers running observability or large-scale data systems, and anyone thinking about how to scale operations without scaling the on-call rotation. The patterns will land hardest if you operate a real-time analytics or metrics store, but most of the ideas apply to any stateful data infrastructure under heavy load.

{Add your bio - who you are; where you work}
Parth Agrawal, Senior Software Engineer at Confluent. I have been working on the observability platform behind Confluent Cloud for the last 3 years.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy