Rootconf Mini 2024

Geeking out on systems and security since 2012

Tickets

Loading…

Tarushi Bhandari

@tarushib

Skynet for Incidents: Intelligent Incident Management using Ansible Playbooks and RAG based LLM

Submitted Oct 30, 2024

Idea

Deep observability is crucial for any software system. This is to ensure that in cases of failure, we have visibility allowing us to solve issues quickly. To help simplify the handling of these issues, we can create automated runbooks—predefined solutions to common problems which can significantly reduce the time it takes to resolve incidents.

This proposed idea aims to use RAG(Retrieval Augumented Generation) based LLMs and Ansible Playbooks as runbooks that respond to alerts from Grafana or other alerting tools to perform initial attempts at solving problems or to provide playbook suggestions to team solving the issues.

Logs/Metrics aggregated from the applications will be stored and monitored where alerts can be configured for specific scenarios such as disk space issues, memory load issues and so on. The alerts will have relevant information such as the error/description which will be sent as a webhook to a handler-api. This handler will have a RAG based LLM implemented with the knowledge database of existing playbooks and their functions in detail, acting as a knowledge repository for incident management and resolution. This handler can then analyse and recommend a playbook to be executed or execute the playbook directly. This can also be used to track resolutions, so that over time, it also begins to act as a resolution knowledge base for others to query for previously resolved incidents.

Advantages:
Faster resolution time and Root Cause Analysis
Automates the first steps in problem-solving, reducing mean time to resolution (MTTR).
Creation of a smart knowledge repository that is enhanced with usage.

Takeaways

Leveraging AI in observability sphere to simplify existing processes
How to leverage existing and commonly used Open Source observability tools to modernise incident response
Understanding how LLM based approach could modify incident response
A simple proof of concept using Grafana, Loki and Ansible

Key Audience

Infrastructure Engineers, SRE

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid Access Ticket

Hosted by

We care about site reliability, cloud costs, security and data privacy