Speak at The Fifth Elephant 2026 Annual Conference
Share you work with the community
Jul 2026
27 Mon
28 Tue
29 Wed
30 Thu
31 Fri 09:00 AM – 06:00 PM IST
1 Sat
2 Sun
heetgala
@heetgala
Submitted Jun 24, 2026
Every team building with LLMs has lived the same story. The demo works beautifully, everyone’s impressed, it ships. Then a real user asks a slightly different question and the answer comes back confidently wrong.
Someone tweaks the prompt to fix it and quietly breaks three other things that nobody notices for weeks. The honest truth is that most AI features today are shipped on vibes: we skim a few outputs, they look fine, and we hope. That works right up until your single prompt grows into a retrieval (RAG) pipeline, and that pipeline grows into an agent that plans, calls tools, and makes decisions on its own. At that point “looks fine” stops being an answer.
AI evals are how you replace hope with evidence, a repeatable way to ask “is my AI doing the right thing?” and get a number you can trust before your users find out for you. This session tells the story of why that matters, and how we built a platform to make evaluation the default, not an afterthought.
Here’s the catch with agents: the answer is only half the story. A plain prompt has one input and one output, so you grade the output and you’re done. An agent takes a journey — it interprets a goal, makes a plan, calls tools, reads the results, and decides its next move. That means it can land on a right-looking answer through a completely broken path (it got lucky, and it’ll break tomorrow), or do everything correctly and still fail at the final step. So before you can trust an agent, you have to grade the whole trajectory, not just the last reply.
In practice that boils down to a handful of plain-English questions you should be able to ask of any agentic system:
Once you can ask those questions consistently, evaluation stops being a vibe check and becomes a checklist — and that’s exactly what we set out to make easy.
We have created an AI evals framework or platform that helps you evaluating your AI agents. I’ll walk through it the way a team would actually adopt it. It starts with a small SDK you drop into your app: a couple of lines capture a full trace of everything the AI did, every prompt, model call, tool, and retrieval step, so you finally get observability into the black box; the same SDK manages your prompts centrally, so you can version, update, and roll back prompts without redeploying code. On top of that sit three kinds of evaluation. Prompt evals score a single prompt against known examples. Pipeline evals check a RAG system end to end, did it fetch the right context, and is the answer actually grounded in it? Agentic evals judge multi-step agents, did it choose the right tool, follow a sensible plan, and avoid unsafe actions? The part people find most useful: you design your own scoring, mixing simple rules with an LLM-as-a-judge using your own rubric, and run it two ways, offline against a curated test set before you ship, and online as live evals that continuously score real production traffic. When something does go wrong, you can chat with the live trace to do root-cause analysis in plain English instead of spelunking through logs. Safety gates and confidence thresholds tie it together, so a change only goes live when the evidence says it’s safe.
I am Heet Gala, working at Nutanix as a Software developer in the SaaS Oranization.
linkedin - https://www.linkedin.com/in/heet-gala-030634191/
https://docs.google.com/presentation/d/1pP-T3K4TErpfThuqRz_q92YqzBxRwwsI8-2HRjtOS88/edit?usp=sharing
{Add the link to 2-min elevator pitch video}
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}