Anuj Gupta

Anuj Gupta

@anuj_gupta

Building AI on Broken Data: A DataOps Playbook from Processing Millions of Corrupted Data Points

Submitted Jun 21, 2026

Building AI on Broken Data: A DataOps Playbook from Processing Millions of Corrupted Data Points

Everyone wants AI. Everyone talks about models. But in many organizations, the real battle is neither model selection nor prompt engineering—it is data quality. At Vahan, a large blue-collar recruitment marketplace, we process million of data points daily from our vendors & external partners.

We discovered that 10–40% of incoming data were routinely corrupted before it even reached our systems. The culprit was not faulty databases or broken pipelines, but seemingly harmless operational processes involving spreadsheets, CSV exports, Google Sheets, locale settings, and manual copy-paste workflows. These issues silently introduced date corruption, data fluctuations, retractions, pullbacks, duplication, and inconsistencies that impacted analytics, forecasting, incentive calculations, and downstream AI systems.

This talk is a production engineering war story on how we built a DataOps platform to detect, prevent, correct, and monitor data quality issues at scale. I will share the surprising failure modes we uncovered, the architecture we built around validation engines, business-critical rules, automated correction workflows, and human-in-the-loop operations, and the metrics we used to continuously improve quality.

Attendees will learn how we transformed data quality from roughly 70% to 99%+, creating a trusted foundation for analytics, forecasting, incentives, and AI systems. More importantly, they will leave with a practical, battle-tested DataOps playbook that can be applied to any organization consuming large volumes of operational or third-party data.

The session covers:

  1. Real-world data corruption patterns rarely discussed in data engineering literature
  2. Why spreadsheets become a hidden source of data quality failures
  3. Designing validation frameworks using business-critical rules
  4. Building automated correction pipelines and exception workflows
  5. Human-in-the-loop data quality operations
  6. Data quality metrics that matter in production
  7. Operational playbooks for organizations consuming third-party data

Key Takeaways

  1. Why 10–40% of production data can be wrong even when upstream systems are correct.
  2. Real-world data corruption patterns caused by spreadsheets, CSVs, and manual workflows.
  3. The four hidden failure modes that break analytics and AI: date corruption, fluctuations, retractions, and pullbacks.
  4. A practical DataOps architecture involving validation engines, business-critical rules, correction workflows, and human-in-the-loop operations.
  5. How to measure, operationalize, and continuously improve data quality using engineering metrics and processes.
  6. Why investing in DataOps often delivers higher ROI than investing in better AI models.

This session going to be beneficial for

  • Data Engineers
  • Data Platform and Infrastructure Engineers
  • Analytics Engineers
  • Machine Learning Engineers
  • MLOps and DataOps Practitioners
  • Engineering Managers and Technical Architects
  • AI/ML Leaders responsible for production AI systems
  • Founders and CTOs building data-intensive products
  • Teams consuming third-party or partner-generated data

Speaker Bio

Anuj Gupta helps Organizations become AI native in the capacity of Head of AI.

More about him


Draft Slide Deck

{Add the link to 2-min elevator pitch video}

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures