Arjun Mahishi

Bringing Down MTTR: Building an AI-Powered Diagnostic Platform for Database Support

Submitted Jun 24, 2026

Title: Bringing Down MTTR: Building an AI-Powered Diagnostic Platform for Database Support
Author: Arjun Mahishi (arjun.mahishi@gmail.com; Cockroach Labs)
Session type: Talk (30 mins)
Track: Building & implementing AI tools & agents in production
Submission for: The Fifth Elephant
Statue of this doc: Draft (still iterating over it; Will be done before 30th June)


Abstract

When a customer reports a problem with their CockroachDB cluster, support engineers need to sift through debug zips containing logs, system table dumps, CPU/heap profiles, metrics, and traces -- often hundreds of megabytes of diagnostic data. The traditional workflow involved
downloading these artifacts to individual laptops, running ad-hoc shell scripts, and context-switching between ticketing systems, secure file transfer tools, and runbooks. Mean Time to Resolve (MTTR) suffered.

We built a centralized diagnostic platform that makes all customer artifacts available on a cloud-backed filesystem, with investigation tools pre-installed on a VM, accessible through a web app. Then we layered AI agents on top to generate preliminary root cause analyses and
let engineers chat with an agent that can query, search, and correlate across all the diagnostic data.

The key design decision: instead of integrating with off-the-shelf observability platforms like Datadog, Loki, or Grafana -- which solve live telemetry, not post-mortem debug artifact analysis -- we exposed everything as a filesystem. This lets AI agents use ripgrep, jq,
DuckDB, and Python -- tools with massive LLM training data -- rather than requiring custom APIs or proprietary query interfaces. The diagnostic data from CockroachDB has a unique shape (system table CSVs, custom profile formats, interleaved multi-node logs) that no single
observability tool models well.

This talk covers:

  • The problem: What makes database diagnostic data fundamentally different from standard observability data, and why existing platforms don’t fit
  • Architecture: Cloud filesystem abstraction, DuckDB for SQL over diagnostic dumps, WebSocket-based chat interface, metadata storage
  • AI agent design: A homemade agent loop built in Go with tool calling and skills, model-agnostic across multiple LLM providers, and why we built our own instead of using agent frameworks
  • Why filesystems beat custom APIs for agents: The core design principle that shaped the entire platform
  • Measuring AI quality in production: LLM-as-judge evals, sentiment analysis, resolution tracking, and success scoring -- because user feedback alone isn’t enough
  • Production learnings: What worked, what didn’t, and what we’d do differently

The talk includes a live demo of the platform.


Key Takeaways

  1. Off-the-shelf observability platforms solve live telemetry, not post-mortem debug artifact analysis
  2. LLMs are already trained on Unix tools -- expose your data as files and let agents use ripgrep/jq/DuckDB instead of building bespoke APIs
  3. Model-agnostic agent design pays off -- building your own agent loop gives you control over tool calling, compaction, and multi-model support without framework lock-in
  4. Measuring AI quality requires multiple signals -- combine user feedback with LLM-as-judge, sentiment analysis, and resolution tracking

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures