rajesh

@rajeshbalamohan_blr

Sachin Chaurasiya

Sachin Chaurasiya

@sachinchaurasiya

Debugging Agents in Production

Submitted Jun 1, 2026

Debugging Agents in Production

Every distributed system is impossible to debug without custom built observability and tracing tools. Multi-agent systems are no different.

A single request from a user may result in a number of concurrent actions across multiple agents and it quickly gets hard to tell where things went wrong in the workflow.

From a customer perspective, most important thing you can optimize for is the speed at which you can find an issue and fix it. Particularly, when the system failures cannot always be consistently reproduced.

At Isotopes AI, we invested heavily into a production quality observability platform with three main goals

  • you should not need a shell window to debug
  • the tooling should indicate problems
  • there should be a neat way to aggregate across multiple instances.

Specifically, we needed to follow a single session across multiple machines, track one agent’s behaviour across many sessions, and look into a single worker for every session it was running at a given moment.

This talk is a practitioner’s account of the observability tooling we built to meet that need and architectural choices we made to allow us to retain the replayability of events.

At its core is an event-sourced view of a session: every exchange behind an answer is captured, so the entire session can be traced and replayed after the fact. From real production traces, we will look at three things that helps us fix customer issues once a system is live:

  • Making the context window work for you — every model carries its own finite context window, and the interesting question isn’t “are we close to the limit” but how that window is being spent: instructions, tool definitions, prior conversation, injected schema and sample data, retrieved memory. We will break down that composition and show how to keep the window lean — trimming what an agent doesn’t need — without the system losing the context it actually depends on.

  • Seeing problems that live between agents — the hardest failures aren’t inside any one agent; they emerge in the hand-offs: a request that quietly retries, a step that re-enters itself, an issue that only shows up when you look across every agent at once. We will show how we surface these cross-agent problems in a single view, and what they reveal about a system under production load.

  • Spending model time wisely — when a request feels slow, the model itself is usually not the whole story. A surprising amount of wall-clock time hides in places you wouldn’t first look, and once you can see where it goes, the fix is often small and obvious. We will walk through one such case that changed how we think about latency in agent systems.

The key takeaway: how to follow a single user question through every agent it touched, and which signals actually tell you whether the system is healthy in production.

Target audience: Engineers who build, operate, or debug LLM agent systems in production

BIO
Rajesh Balamohan has been working as an engineer in Isotopes AI; Prior to this he worked in Salesforce, Waii and worked in companies like Cloudera/Hortonworks on bigdata performance tuning.

Sachin Chaurasiya is working as an engineer in Isotopes AI; Prior to that he worked in Collate contributing heavily in OpenMedata.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures