Rootconf 2025 Annual Conference CfP

Speak at Rootconf 2025 Annual Conference

Tickets

Loading…

Preeti

Preeti

@preetidewani

Reducing Alert Fatigue: AI Agents for Improved Observability

Submitted Apr 18, 2025

Introduction

Imagine being the on-call engineer for a high-traffic system. Every alert that comes through might be important, or it might just be noise. Over time, that constant stream of notifications becomes overwhelming. Critical issues get buried, engineers burn out, and the system’s reliability suffers. We’ve come to treat this as a normal part of working with distributed systems. But it doesn’t have to be.

Bringing On-call sanity Back: How AI is transforming observability

In this talk, I will walk through how we transformed a chaotic environment that generated thousands of daily alerts into a system that boosts engineer productivity instead of hindering it. We’ll discuss how we proactively addressed the alert fatigue problem before it escalated and fostered a culture where the solution became ingrained in engineering practices.

Our breakthrough came when we approached the alert fatigue problem from an engineering perspective. We asked ourselves, “If engineers write tests before releasing code to production, why can’t similar tools be integrated into the CI/CD workflow to test the usefulness of configured alerts?”

How AI Agent works

This talks goes beyond theory to show how we built our solution:

  • The agent starts with historical analysis, learning from past alerts to identify patterns that separate noise from meaningful signals.

  • It takes context into account, not just the configuration of the alert. This includes how previous alerts were handled, what was happening in the system at the time, and whether a deployment was in progress.

  • During development, it evaluates new alert configurations before they’re deployed. That way, engineers get feedback early on whether an alert is likely to be useful.

Challenges we faced along the way

Building something like this was not easy.

  • Balancing precision and recall: filtering out the noise without missing real problems.
  • Building trust: Engineers didn’t want a black box, so we made sure the system explained why it was classifying something as noise.
  • Integration complexity: Plugging into different alerting tools and CI/CD workflows meant building integrations that were flexible and reliable.

What You’ll Take Away

  • Considerations for developing AI tools.
  • Strategies for enhancing AI tool performance based on specific use cases.
  • A functional AI agent for noise reduction.

Come for the AI, stay for the on-call sanity. Let’s discover how AI can transform observability from a reactive burden to a proactive advantage that makes both your systems and teams healthier.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

We care about site reliability, cloud costs, security and data privacy