heetgala

@heetgala

Stop Shipping AI on Vibes: AI Evals for Agents

Submitted Jun 24, 2026

About this session

Every team building with LLMs has lived the same story. The demo works beautifully, everyone’s impressed, it ships. Then a real user asks a slightly different question and the answer comes back confidently wrong.

Someone tweaks the prompt to fix it and quietly breaks three other things that nobody notices for weeks. The honest truth is that most AI features today are shipped on vibes: we skim a few outputs, they look fine, and we hope. That works right up until your single prompt grows into a retrieval (RAG) pipeline, and that pipeline grows into an agent that plans, calls tools, and makes decisions on its own. At that point “looks fine” stops being an answer.

AI evals are how you replace hope with evidence, a repeatable way to ask “is my AI doing the right thing?” and get a number you can trust before your users find out for you. This session tells the story of why that matters, and how we built a platform to make evaluation the default, not an afterthought.

First — what should an agentic eval actually check?

Here’s the catch with agents: the answer is only half the story. A plain prompt has one input and one output, so you grade the output and you’re done. An agent takes a journey — it interprets a goal, makes a plan, calls tools, reads the results, and decides its next move. That means it can land on a right-looking answer through a completely broken path (it got lucky, and it’ll break tomorrow), or do everything correctly and still fail at the final step. So before you can trust an agent, you have to grade the whole trajectory, not just the last reply.

In practice that boils down to a handful of plain-English questions you should be able to ask of any agentic system:

  • Did it understand the goal? - or confidently solve the wrong problem.
  • Was the plan sensible? - reasonable steps, no aimless looping or backtracking.
  • Did it call the right tools, the right way? - correct tool, correct arguments, nothing skipped or invented.
  • Is each step actually correct? - not just the final answer, but the reasoning and intermediate results along the way.
  • Is the answer grounded? - backed by what it really retrieved, not made up.
  • Did it stay safe? - no destructive actions, policy violations, or leaked data.
  • Did it finish the job? - and at a reasonable cost, latency, and number of steps.

Once you can ask those questions consistently, evaluation stops being a vibe check and becomes a checklist — and that’s exactly what we set out to make easy.

We have created an AI evals framework or platform that helps you evaluating your AI agents. I’ll walk through it the way a team would actually adopt it. It starts with a small SDK you drop into your app: a couple of lines capture a full trace of everything the AI did, every prompt, model call, tool, and retrieval step, so you finally get observability into the black box; the same SDK manages your prompts centrally, so you can version, update, and roll back prompts without redeploying code. On top of that sit three kinds of evaluation. Prompt evals score a single prompt against known examples. Pipeline evals check a RAG system end to end, did it fetch the right context, and is the answer actually grounded in it? Agentic evals judge multi-step agents, did it choose the right tool, follow a sensible plan, and avoid unsafe actions? The part people find most useful: you design your own scoring, mixing simple rules with an LLM-as-a-judge using your own rubric, and run it two ways, offline against a curated test set before you ship, and online as live evals that continuously score real production traffic. When something does go wrong, you can chat with the live trace to do root-cause analysis in plain English instead of spelunking through logs. Safety gates and confidence thresholds tie it together, so a change only goes live when the evidence says it’s safe.

Key takeaways

  1. A mental model for evaluating any AI or agentic system. Separate the three layers : prompt → pipeline → agent — decide what “good” means at each, and combine deterministic checks with LLM-as-a-judge scoring. You’ll also learn why offline evals (before shipping) and online live evals (on real traffic) answer different questions, and why you need both.
  2. A concrete blueprint for production confidence. How tracing, prompt management, custom scoring, safety gates, and root-cause analysis fit into one workflow, so you can go from “it looked fine in the demo” to shipping AI changes you can actually defend with data.

Who should attend

  • AI/ML engineers and data scientists building LLM, RAG, or agentic features who want to measure quality, not guess at it.
  • Platform / backend engineers responsible for getting AI into production and keeping it reliable.
  • Engineering leads, PMs, and QA who need to trust AI quality and de-risk every release.
  • Anyone exploring agentic workflows who wants a clear vocabulary for “how do we know it works?”

Bio

I am Heet Gala, working at Nutanix as a Software developer in the SaaS Oranization.
linkedin - https://www.linkedin.com/in/heet-gala-030634191/

Draft slides

https://docs.google.com/presentation/d/1pP-T3K4TErpfThuqRz_q92YqzBxRwwsI8-2HRjtOS88/edit?usp=sharing

Draft video

{Add the link to 2-min elevator pitch video}

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures