IT-Bench: A First-of-a-Kind Extensible Open-Source Framework for Benchmarking AI Agents in IT Operations

Submitted Apr 9, 2025

Topic of your submission: SRE Type of submission: 30 mins talk I am submitting for: Rootconf Annual Conference 2025

Description

IT Operations (ITOps) underpins modern cloud-native infrastructure, ensuring the reliability, performance, and security of applications deployed across container orchestrators and distributed environments. As organizations embrace GenAI-powered ITOps—developing agentic solutions for failure detection, root cause analysis, remediation, and more—a significant challenge arises: the lack of standardized benchmarks, test suites, and leaderboards to evaluate and compare these emerging solutions.

Unlike other domains (such as Software Engineering, Code Generation, etc.) where robust benchmarking systems already exist, ITOps lacks a unified framework for evaluating the effectiveness of AI powered ITOps agents. The core challenges include:

(a) Simulating realistic, complex incident scenarios, and
(b) Building dynamic, interactive environments where agents can detect, diagnose, and remediate issues in real time.

In this session, we shall introduce ITBench—an open-source, cloud-native, and extensible benchmarking framework purpose-built to evaluate AI-driven ITOps solutions. ITBench supports diverse, real-world incident simulations on standardized applications and provides a systematic approach to assessing AI agents across domains such as SRE, CISO, and FinOps. We will share our development journey and showcase how ITBench fosters innovation in intelligent IT operations.

Takeaways

This session offers both conceptual understanding and demo experience with the evolving landscape of GenAI in ITOps. Attendees will explore the challenges of the ITOps domain and how GenAI can address them through practical demos involving:

Incident generation
Application & Observability stack
Leaderboard
ITOps AI agent benchmarking

Participants will:

Gain insights into key challenges in automating IT operations
Explore the role and impact of LLM-powered agents in real-world ITOps
Experience a live demo using ITBench to benchmark GenAI-based agents
Learn to design and contribute realistic failure scenarios for agent evaluation
Apply structured benchmarking methodologies to assess agent performance

Beneficial For

SREs
Infrastructure teams
Cloud platforms teams
SystemOps
DevOps

Open Source

IT-Bench GitHub: https://github.com/IBM/ITBench
IT-Bench Incident Scenarios: https://github.com/IBM/ITBench-Scenarios
IT-Bench SRE Agent: https://github.com/IBM/itbench-sre-agent

Presenters

Mudit Verma

Mudit Verma is a Research Manager and Senior Research Engineer at IBM Research Lab – India. With over nine years of experience in distributed systems, cloud computing, and telecom modernization, he is a co-inventor on more than 25 patents and a co-author of multiple research papers in top-tier conferences. His current focus is on observability and IT operations for large-scale cloud-native systems. Mudit holds bachelor and master degrees in Computer Science from BITS Pilani and KTH Sweden respectively.

Harshit Kumar

Harshit Kumar is a Senior Technical Staff Member at IBM India Research Laboratory, specializing in AIOps, Conversational AI, and Information Retrieval. He leads the development of AI-driven solutions for IT Operations and Services at IBM and has received several accolades, including IBM Outstanding Technical Achievement Awards, Research Awards, and Patent Awards. Harshit holds a Ph.D. in Computer Science and Engineering from Seoul National University.

Rootconf 2025 Annual Conference CfP