The Fifth Elephant 2024 Annual Conference (12th &13th July)
Maximising the Potential of Data — Discussions around data science, machine learning & AI
Jul 2024
8 Mon
9 Tue
10 Wed
11 Thu
12 Fri
13 Sat 09:00 AM – 06:05 PM IST
14 Sun
Atri Mandal
The ever-increasing volume of alerts generated by monitoring tools poses a significant challenge for IT Operations teams. A substantial portion of these alerts are duplicates or false positives, overwhelming ITOps practitioners and hindering the timely identification of critical issues. Traditional methods for managing alert floods, such as manual filtering, are ineffective, prone to human error, and often become stale as the alert rules and configurations evolve. Conversely, deep learning-based approaches offer high accuracy but come with substantial computational and infrastructure costs.
This paper presents a novel, lightweight, and cost-effective solution for reducing alert noise in IT Operations. Our approach leverages a combination of log clustering algorithms and semantic similarity techniques that incorporate online learning for adaptability. We exploit the inherent patterns within alerts, often triggered by tools monitoring interconnected systems, to group similar alerts into clusters. Our experimental results demonstrate the effectiveness of the proposed approach across three key metrics: a high matching rate (upwards of 90%) for incoming alerts, a significant reduction in alert noise (over 99%) for matched alerts along with a high throughput, exceeding 1000 alerts per second. To address potential storage concerns, we propose a single-signature representation for each cluster, preventing exponential growth in storage costs.
This research offers a practical and efficient solution for IT Operations teams to tackle alert fatigue, improve their ability to identify critical issues, and ultimately enhance overall system health.
What is the problem?
Why do we need Alert Grouping?
Leveraging Log Clustering and Semantic Similarity
System Architecture
Experimental Evaluation and Results
Conclusion and Future Work
References
Who is the audience for your session?
The target audience for the session are:
What problem/pain are you trying to solve (for the audience)?
The session focuses on addressing the critical challenge of alert flooding in IT Operations. The high volume of operational alerts, often with duplicates and false positives, causes alert fatigue and makes timely identification of critical issues difficult. This results in elevated business costs because of resource wastage and potential downtime from delayed resolution.
How will participants benefit from your session? What are the practical and specific ways in which they will be able to apply the knowledge they gain, and beyond just general awareness.
{Replace this with an explanation of the impact of your work within your organization.}
Hosted by
Supported by
Gold Sponsor
Sponsor
Community Partner
Beverage Partner
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}