AI evals workshop

Submitted Jun 10, 2026

Overview

  1. Why do Agents make mistakes - 3 Gulfs [Comprehension, Specification and Generalization]. (10 min)
  2. Challenges of evaluating agent responses . Why is it different from standard software testing on ML system testing (10 min)
  3. Component wise evaluation of agents (What is equivalent of module level testing in Agents) (30 min)
  4. How to generate synthetic data to evaluate your agents - Hands on activity (20 min)
  5. How to come up with metrics to evaluate an agent that generates linkedin posts automatically - Error analysis - Group Activity hands on (50 min)
  6. How to deal with subjectivity among reviewers? (15 min)
  7. LLM as a judge to evaluate Agents at scale (30 min)
  8. Wrap up - 15 min

Bio

Abhijith Neerkaje is co-founder Beyond Vectors - https://www.linkedin.com/in/abhijithneerkaje/

Slides

Incoming

Video

Incoming

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures