Rajmohan C

Diagnosing data pipeline failures with LLM agents: from research to production, and the open challenges

Submitted Jun 19, 2026

Modern enterprise data platforms rely on complex data pipelines for data transformation and integration, with thousands running every day to move data across systems. When a data pipeline run fails, the error you see is usually not the root cause. Failures surface far from where they are born, and the component that throws the error is rarely the one at fault; so diagnosing them means reasoning across logs, pipeline structure, transformation logic, schema, and other observability data to find where the failure actually originated, not where it surfaced. Most observability tooling just hands you raw logs, metrics, and piles of metadata, and leaves the reasoning to you. We built LLM agents that do this reasoning automatically — classify the failure, trace the causal chain back to the true root cause, and propose grounded remediation and shipped them into a production data platform(IBM Watsonx.data integration), where they now run against live pipelines.

This talk is the honest arc of that journey: how the agents work, and the decisions that mattered far more than model choice — log and pipeline preprocessing, reasoning rigor, handling knowledge gaps, hallucination control, cost optimization, and human-in-the-loop hooks. We’ll cover how we evaluated the system early on with limited data, and how we measure the real-world impact of these agents. We’ll also look at skill-based agents for cases where the agent has direct environment access, and the new challenges that opens up. And we’ll delve into the problem we’re actively tackling now: how do you know the agent’s diagnosis was actually right, at scale, when there’s no labelled data for every failure? How do you reliably improve the agent over time?

Takeaways:

  1. A generalized diagnosis-agent architecture for data pipelines: classify, trace the causal chain (symptom vs. cause), recommend.
  2. What turns a demo into a trusted dataops system: tooling, preprocessing, reasoning rigor, hallucination control, cost optimization, and keeping a human in the loop, all of which matter more than the base model.
  3. Why evaluating diagnosis agents is hard: failure-complexity tiers, no ground-truth labels, and scoring reasoning paths rather than answers. Their evaluation remains a genuinely open problem.

Benefits to the Ecosystem:
AI agents are being incorporated into data platforms everywhere to diagnose and remediate pipeline failures, but little has been said publicly about what makes it trustworthy in production or about the uncomfortable fact that almost no one can rigorously measure whether their diagnosis agent is correct across complex scenarios, let alone improve it reliably over time. This talk shares those lessons and challenges.

Bio:
Rajmohan C is a Senior Research Engineer at IBM Software Innovation Labs(formerly IBM Research), Bengaluru, with over a decade of applied-research experience in Data & AI and a consistent track record of taking innovations from research into products with real-world impact. While his past work spans data lineage and provenance, data quality, and GenAI for tabular-data tasks, he currently works on Agentic DataOps, specializing in AI-driven diagnosis of data pipelines. He has several peer-reviewed papers and patents to his name. He has contributed to multiple enterprise-scale product GAs over the course of his career.

Draft slides:
To be added later.

Elevator pitch:
To be added later

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures