Jun 2026
15 Mon
16 Tue
17 Wed
18 Thu
19 Fri 02:00 PM – 06:00 PM IST
20 Sat
21 Sun
Joshua Thomas
@pilgrimjt
Submitted Jun 15, 2026
Most AI evaluation frameworks assume the AI is the actor being judged. In production healthcare, we inverted this: AI became the primary evaluator of healthcare providers against quality criteria defined by a US virtual health system. CogniSwitch deployed a Trust Layer that ingests doctor-patient conversations, mines them against medical ontologies and customer-defined QA criteria, and produces a 360-degree quality view across every actor in the system. The pipeline runs on live conversation data, integrated across the customer’s Snowflake & Postgres via an orchestration layer with versioning.
Getting this to production was a marathon with surprises along the way. Covering the full distribution of real-world cases in our data took longer than the model work. Schema violations cascaded through the pipeline in non-obvious ways. And the LLM Gateway we instrumented for token cost and behavior analysis surfaced model patterns that no offline evaluation had caught. This talk covers what worked, what broke and everything in between that got us to production.
Takeaways:
Audience:
ML and backend engineers building AI pipelines in regulated or high-stakes domains; anyone responsible for operating AI systems where outputs affect real decisions; teams wrestling with evaluation that goes beyond offline accuracy metrics.
Bio:
Hi, I’m Joshua, Co-Founder and CTO at CogniSwitch, a Trust Layer for Agents (AI / human - no discrimination) in Regulated Industries (Healthcare, Finance). Previously at Aikon Labs. Decade of engineering Software, Data, ML, DL, CL & IR. Built iEngage.ai (a platform used by enterprises to power ~100 use cases & apps) & Ariv.ai (a knowledge bot using conversations in MS Teams & Slack pre-GPT-3)
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}