Rootconf Mini 2024 (on 22nd & 23rd Nov)

Geeking out on systems and security since 2012

Tarushi Bhandari

@tarushib

Skynet for Incidents: Intelligent Incident Management using Ansible Playbooks and RAG based LLM

Submitted Oct 30, 2024

Idea

Deep observability is crucial for any software system. This is to ensure that in cases of failure, we have visibility allowing us to solve issues quickly. To help simplify the handling of these issues, we can create automated runbooks—predefined solutions to common problems which can significantly reduce the time it takes to resolve incidents.

This proposed idea aims to use RAG(Retrieval Augumented Generation) based LLMs and Ansible Playbooks as runbooks that respond to alerts from Grafana or other alerting tools to perform initial attempts at solving problems or to provide playbook suggestions to team solving the issues.

Logs/Metrics aggregated from the applications will be stored and monitored where alerts can be configured for specific scenarios such as disk space issues, memory load issues and so on. The alerts will have relevant information such as the error/description which will be sent as a webhook to a handler-api. This handler will have a RAG based LLM implemented with the knowledge database of existing playbooks and their functions in detail, acting as a knowledge repository for incident management and resolution. This handler can then analyse and recommend a playbook to be executed or execute the playbook directly. This can also be used to track resolutions, so that over time, it also begins to act as a resolution knowledge base for others to query for previously resolved incidents.

Advantages:
Faster resolution time and Root Cause Analysis
Automates the first steps in problem-solving, reducing mean time to resolution (MTTR).
Creation of a smart knowledge repository that is enhanced with usage.

Takeaways

Leveraging AI in observability sphere to simplify existing processes
How to leverage existing and commonly used Open Source observability tools to modernise incident response
Understanding how LLM based approach could modify incident response
A simple proof of concept using Grafana, Loki and Ansible

Key Audience

Infrastructure Engineers, SRE

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy

Supported by

Platinum Sponsor

Nutanix is a global leader in cloud software, offering organizations a single platform for running apps and data across clouds.

Platinum Sponsor

PhonePe was founded in December 2015 and has emerged as India’s largest payments app, enabling digital inclusion for consumers and merchants alike.

Silver Sponsor

The next-gen analytics engine for heavy workloads.

Sponsor

Community sponsor

Peak XV Partners (formerly Sequoia Capital India & SEA) is a leading venture capital firm investing across India, Southeast Asia and beyond.

Venue host - Rootconf workshops

Thoughtworks is a pioneering global technology consultancy, leading the charge in custom software development and technology innovation.

Community Partner

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Community Partner

A community of Rust language contributors and end-users from Bangalore. We have presence on the following telegram channels https://t.me/RustIndia https://t.me/fpncr LinkedIn: https://www.linkedin.com/company/rust-india/ Twitter (not updated frequently): https://twitter.com/rustlangin more