Sathish

@sathish316

Agentic approach to Observability-based RCA using Semantic and Text2SQL Engines

Submitted Jan 9, 2026

Session Description

AIOps system for Software Incidents Root-cause-analysis (RCA) requires Triangulation of data/signals from multiple sources and reasoning about the Alert/Incident from the context of these signals:

  • Code changes/deployments, Config changes, Feature flag rollouts
  • Metrics anomalies and outliers
  • Log errors/exceptions
  • System architecture

In this talk, we will discuss an RCA Engine that we have built at Atlassian using Multi-source signal analysis from Observability/MELT sources (like Metrics, Events, Logs, Traces), Context graphs. We propose a novel approach for RCA-MELT analysis using

  1. Multi-Source Signal Analysis: How we query, analyze and correlate signals across Metrics, Logs, Service Context graphs.
  2. Agentic approach to metrics/log analysis: A deep dive into using Semantic Engine combined with Text2SQL capabilities to allow agents to autonomously query and reason about metrics/logs.
  3. Evaluation: The quantitative results of this approach, highlighting its accuracy in Metric/Log query generation and Fault identification within our Fault-simulation suites

Key Takeaways

  1. Multi-Agent Architecture: How to orchestrate complex diagnostic tasks among specialized agents
  2. Semantic Layer - Semantic understanding of Incident, Metrics to find relevant metrics to debug
  3. Text2SQL Layer - Translating Intent to accurate metrics/logs queries
  4. Reliable Analysis: Techniques for combining LLMs with statistical models to accurately filter high-volume data (metrics and logs) during an incident.

Target Audience:

  • Software Engineers
  • Site Reliability Engineers (SREs) and DevOps Engineers
  • AI/ML Engineers interested in practical applications of agents

Bio

Vinith Kumar is a Machine Learning Engineer with 10+ years experience in building AI/ML systems for SaaS and enterprise products. Currently at Atlassian working on AI agents, observability, and root cause analysis. Previously led ML teams at Factors.ai and Freshworks.

Sathish Kumar is a Senior Principal Engineer at Atlassian working on problems in ITOps, Alerts/Incidents management, AIOps. He has spent close to 20 years in the industry working on large-scale distributed systems, ML ranking systems and b2c/b2b products/platforms.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures