The Fifth Elephant

Beyond accuracy: the evaluation challenge in AI analytics

Hands-on workshop

Dec 2025

1 Mon

2 Tue

3 Wed

4 Thu

5 Fri 10:00 AM – 01:00 PM IST

6 Sat

7 Sun

Free for members

Dec 2025

1 Mon

2 Tue

3 Wed

4 Thu

5 Fri 10:00 AM – 01:00 PM IST

6 Sat

7 Sun

Tickets

Workshop overview

Your AI analytics system can be confidently wrong a significant portion of the time, and you have no idea because you’re measuring the wrong things. Traditional BI has clear metrics — accuracy, latency, query performance. But AI-powered analytics systems (Text2SQL, RAG-based BI, conversational analytics) are being deployed without proper evaluation frameworks, leading to hallucinations, incorrect insights, and business decisions based on wrong data.

This masterclass is for product managers, data scientists, and engineering managers who are building or planning AI analytics systems and want to move beyond surface-level metrics to build truly trustworthy systems. Through a real production failure case study, interactive framework building, and hands-on exercises, you’ll learn a practical evaluation framework that you can begin applying to your systems.

Part 1: the evaluation challenge (15 minutes)

Module 1: when “good” metrics hide bad systems

Live demonstration: A Text2SQL system which looks good on the surface :)
Interactive exercise: Spot the failure in a “successful” AI analytics output.
Case study introduction: Production RAG-based analytics chatbot failure.

Part 2: the 4-layer evaluation framework (35 minutes)

Module 2: understanding the evaluation stack

Layer 1: Syntactic evaluation

Query correctness, format validity, schema compliance.
Standard metrics: Parse success rate, SQL validity.
What it catches: Obvious formatting and syntax errors.

Layer 2: Semantic evaluation

Faithfulness to data, hallucination detection.
Semantic entropy as an uncertainty signal.
Context alignment and retrieval quality in RAG systems.
What it catches: Correct-looking but semantically wrong outputs.

Layer 3: Business logic evaluation

Domain rule conformance, edge case handling.
Business-specific constraints and validations.
What it catches: technically correct but business-invalid results.

Layer 4: Human alignment evaluation

Intent matching, appropriate uncertainty expression.
User satisfaction and trust metrics.
What it catches: Correct results that don’t answer the user’s real question.

Part 3: Framework application & design (30 minutes)

Module 4: Hands-on evaluation design

Participants work in groups on real-world scenarios:

Scenario A: Text2SQL for sales analytics dashboard
Scenario B: RAG-powered financial reporting chatbot
Scenario C: Conversational BI for product metrics exploration

Whiteboard exercise:

Select one evaluation layer to design an evaluation approach.
Define specific metrics for your chosen layer.
Identify what “failure” looks like at this layer.
Design detection mechanisms for pre-production catching.
Present evaluation strategy to the larger group.

Part 4: Challenges & Q&A (10 minutes)

Module 5: Challenges while taking It to Production

Pragmatic implementation: start with Layers 1&2, iterate toward 3&4.
Cost-benefit analysis: evaluation at all layers requires engineering time, compute resources, and adds latency—understand the trade-offs.
When to use which layer: match evaluation complexity to business risk.
Common pitfalls and how to avoid them.
Action item: identify one missing evaluation layer in your current system.
Resources: semantic entropy papers, evaluation framework templates.

Key takeaways

Traditional ML metrics (accuracy, F1) are necessary but insufficient for AI analytics systems — they miss semantic failures that impact business decisions.
The 4-layer evaluation framework provides a systematic approach: Syntactic → Semantic → Business Logic → Human Alignment, with each layer catching different failure modes.
Evaluation must be designed before deployment, not after incidents -- reactive evaluation is expensive; proactive frameworks prevent business impact.

Prerequisites

Basic understanding of analytics/BI systems and AI/LLM applications.
Familiarity with production system challenges (useful but not mandatory).
Bring: A laptop with python-set up to make LLM calls and a real or hypothetical AI analytics use case you’re working on.

About the instructor

Karrtik Iyer is a Principal AI Researcher at TAILS specializing in LLM interpretability and evaluation frameworks. He heads the Data Science and AI Community at Thoughtworks India.
His research focuses on trustworthy AI for safety-critical applications. Recent publications include “GNN-RAG: Bridging Graph Reasoning and Language Understanding” (NODES 2024) and “Towards Transparent AI Grading: Semantic Entropy as a Signal for Human-AI Disagreement.” He has built production Text2SQL and RAG systems for enterprise clients, serves as technical reviewer for “Graph Neural Networks in Action,” and was featured on Open Source Magazine cover for work on ethical AI.
Technical expertise: LLM Interpretability, Advanced RAG Systems, Semantic Entropy for Evaluation, Agentic AI Frameworks, Knowledge Graphs, Trustworthy AI in Healthcare and Legal domains.

This master class is sponsored by Thoughtworks India.