Jul 2026

27 Mon

28 Tue

29 Wed

30 Thu

31 Fri 09:00 AM – 06:00 PM IST

1 Sat

2 Sun

Building AI on Broken Data: A DataOps Playbook from Processing Millions of Corrupted Data Points

Submitted Jun 21, 2026

I am submitting for: Track 1 - Data engineering & infrastructure Type of session: 30 mins talk

Building AI on Broken Data: A DataOps Playbook from Processing Millions of Corrupted Data Points

Everyone wants AI. Everyone talks about models. But in many organizations, the real battle is neither model selection nor prompt engineering—it is data quality. At Vahan, a large blue-collar recruitment marketplace, we process million of data points daily from our vendors & external partners.

We discovered that 10–40% of incoming data were routinely corrupted before it even reached our systems. The culprit was not faulty databases or broken pipelines, but seemingly harmless operational processes involving spreadsheets, CSV exports, Google Sheets, locale settings, and manual copy-paste workflows. These issues silently introduced date corruption, data fluctuations, retractions, pullbacks, duplication, and inconsistencies that impacted analytics, forecasting, incentive calculations, and downstream AI systems.

This talk is a production engineering war story on how we built a DataOps platform to detect, prevent, correct, and monitor data quality issues at scale. I will share the surprising failure modes we uncovered, the architecture we built around validation engines, business-critical rules, automated correction workflows, and human-in-the-loop operations, and the metrics we used to continuously improve quality.

Attendees will learn how we transformed data quality from roughly 70% to 99%+, creating a trusted foundation for analytics, forecasting, incentives, and AI systems. More importantly, they will leave with a practical, battle-tested DataOps playbook that can be applied to any organization consuming large volumes of operational or third-party data.

The session covers:

Real-world data corruption patterns rarely discussed in data engineering literature
Why spreadsheets become a hidden source of data quality failures
Designing validation frameworks using business-critical rules
Building automated correction pipelines and exception workflows
Human-in-the-loop data quality operations
Data quality metrics that matter in production
Operational playbooks for organizations consuming third-party data

Key Takeaways

Why 10–40% of production data can be wrong even when upstream systems are correct.
Real-world data corruption patterns caused by spreadsheets, CSVs, and manual workflows.
The four hidden failure modes that break analytics and AI: date corruption, fluctuations, retractions, and pullbacks.
A practical DataOps architecture involving validation engines, business-critical rules, correction workflows, and human-in-the-loop operations.
How to measure, operationalize, and continuously improve data quality using engineering metrics and processes.
Why investing in DataOps often delivers higher ROI than investing in better AI models.

This session going to be beneficial for

Data Engineers
Data Platform and Infrastructure Engineers
Analytics Engineers
Machine Learning Engineers
MLOps and DataOps Practitioners
Engineering Managers and Technical Architects
AI/ML Leaders responsible for production AI systems
Founders and CTOs building data-intensive products
Teams consuming third-party or partner-generated data

Speaker Bio

Anuj Gupta helps Organizations convert AI aspirations into concrete AI systems that deliver outcomes (in the capacity of Head of AI).

Built the flagship AI system for a YC startup that:
- Impressed Sam Altman & Vinod Khosla
- Showcased by OpenAI in their flagship events
- Helped YC startup secure funding from Khosla Ventures
- Vinod Khosla spoke about this system, addressing Hon’le PM at the AI CEO roundtable at the recently concluded India AI summit, 2026
Built core AI systems at one of India’s earliest AI startup, acquired for its AI capabilities by Nasdaq listed Unicorn.
Published a major book in AI with Oreilly, US; endorsed by top names in AI including those from CMU, UCSD, DeepMind, Google AI, flagship YC startups like Airbnb
Brings 20+ yrs of expericnce of building 50+ AI systems across startups & Fortune 50; serving first decade as AI researcher & later decade as senior AI Leader

More about him

Draft Slide Deck

{Add the link to 2-min elevator pitch video}

Speak at The Fifth Elephant 2026 Annual Conference