Nov 2024
18 Mon
19 Tue
20 Wed
21 Thu
22 Fri 09:00 AM – 05:15 PM IST
23 Sat 09:00 AM – 06:15 PM IST
24 Sun
Tarushi Bhandari
Deep observability is crucial for any software system. This is to ensure that in cases of failure, we have visibility allowing us to solve issues quickly. To help simplify the handling of these issues, we can create automated runbooks—predefined solutions to common problems which can significantly reduce the time it takes to resolve incidents.
This proposed idea aims to use RAG(Retrieval Augumented Generation) based LLMs and Ansible Playbooks as runbooks that respond to alerts from Grafana or other alerting tools to perform initial attempts at solving problems or to provide playbook suggestions to team solving the issues.
Logs/Metrics aggregated from the applications will be stored and monitored where alerts can be configured for specific scenarios such as disk space issues, memory load issues and so on. The alerts will have relevant information such as the error/description which will be sent as a webhook to a handler-api. This handler will have a RAG based LLM implemented with the knowledge database of existing playbooks and their functions in detail, acting as a knowledge repository for incident management and resolution. This handler can then analyse and recommend a playbook to be executed or execute the playbook directly. This can also be used to track resolutions, so that over time, it also begins to act as a resolution knowledge base for others to query for previously resolved incidents.
Advantages:
Faster resolution time and Root Cause Analysis
Automates the first steps in problem-solving, reducing mean time to resolution (MTTR).
Creation of a smart knowledge repository that is enhanced with usage.
Leveraging AI in observability sphere to simplify existing processes
How to leverage existing and commonly used Open Source observability tools to modernise incident response
Understanding how LLM based approach could modify incident response
A simple proof of concept using Grafana, Loki and Ansible
Infrastructure Engineers, SRE
Hosted by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}