Traditional observability often means sifting through fragmented data and relying on manual expertise, leading to slow incident resolution. At Atlan, we’ve transformed our approach by building Atlan Scout, an AI-native observability system. This system has dramatically cut our P90 incident mitigation time by 70% through proactive anomaly detection and automated root cause analysis.
This talk shares our journey of developing Atlan Scout, an AI agent that leverages ClickHouse for high-cardinality metrics and VictoriaMetrics for time-series data. We’ll show how Scout correlates data across infrastructure layers and delivers actionable insights directly via Slack, moving us from reactive firefighting to proactive resolution.
As distributed systems scale, complexity skyrockets, rendering traditional monitoring insufficient for rapid incident response. Teams struggle with alert fatigue, prolonged outages, and the high cost of expert dependency. This talk offers a practical, field-tested blueprint for engineering teams looking to harness AI for building next-generation observability. We’ll share real production experience, measurable results (including a 70% reduction in resolution time), and lessons learned, empowering you to build faster, more reliable incident response systems.
This session will provide a technical deep dive into building an AI-powered observability system, “Atlan Scout,” focusing on the architecture, challenges, and learnings.
- In this talk, we’ll start by exploring the common pitfalls of traditional observability, highlighting why these approaches often fall short during critical incidents due to fragmented information and over-reliance on specialized expertise. We’ll then share our vision for democratizing incident response through AI. From there, we’ll take you on a deep dive into the technical heart of “Atlan Scout,” our AI-powered system. You’ll learn about our data foundation, built with ClickHouse for high-cardinality metrics and VictoriaMetrics for time-series data, and how our AI pipeline utilizes LLMs for advanced anomaly detection and pattern matching. We’ll also cover the crucial integration layer connecting Slack, Grafana, and Kubernetes events, along with our strategies for handling complex correlation queries across vast datasets.
- Next, we’ll focus on how we built proactive issue detection, detailing the ML models we use for spotting anomalies, the correlation algorithms that slash diagnosis time, our use of decision trees for automated root cause analysis, and how real-time pattern matching against historical incidents helps us identify problems faster. Of course, building such a system comes with its hurdles, so we’ll openly discuss the engineering challenges we faced—like managing scale with high-velocity metrics, ensuring data quality, building rich context for our LLMs, and achieving sub-second latency for critical alerts—and the solutions we engineered. Following that, we’ll share the exciting results and tangible impact, including the significant reduction in our P90 incident resolution time, fewer escalations, and a much-improved developer experience. Finally, we’ll wrap up by sharing valuable lessons learned from our AI observability journey—what worked, what didn’t—along with specific ClickHouse performance optimization techniques and a glimpse into our future roadmap, which includes automated remediation and predictive maintenance.
Attendees will learn:
- A technical blueprint for constructing AI-powered observability systems.
- Best practices for utilizing ClickHouse and VictoriaMetrics at scale for observability workloads.
- Effective ML techniques for anomaly detection and root cause analysis in production environments.
- Integration patterns for creating seamless AI-human collaboration in incident response workflows.
- Practical lessons from our experience achieving a 70% reduction in incident response time.
- Primary: Site Reliability Engineers, Platform Engineers, Data Engineers involved in building or managing observability infrastructure.
- Secondary: Engineering Leaders, DevOps Managers, and Technical Architects looking to enhance incident response processes and system reliability through AI.
Aayush works with the Platform Engineering team at Atlan as an Engineering Manager, where he currently leads the Observability & Incident Management initiatives.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}