Asif Mansoor

@asifmansoora

Why Data Quality Matters When Working with Data at Scale

Submitted May 16, 2026

Most data quality problems aren’t bugs in the data. They’re broken contracts between producers and consumers. The contract gets implicitly defined when the first staging pipeline runs, then quietly violated in production when an upstream service ships a “harmless” schema change, a field gets nullified, or volume changes by 10x without warning. By the time the dashboards look wrong, the bad data has propagated everywhere.

This lightning talk walks through a practical two-layer framework for enforcing data quality as a first-class engineering concern rather than retrofitting it as cleanup. Producer-level enforcement using strict schemas, schema registries, and Avro-formatted contracts with forward and backward compatibility checks. Processing-layer enforcement using the Apache Iceberg Write-Audit-Publish (WAP) pattern, with blocking and non-blocking quality checks before data is committed to live tables. The operational reality of running this at billions of events daily across petabytes, and the one architectural decision that didn’t survive contact with production.

  1. A concrete two-layer architecture for enforcing data quality in production: producer-layer schema contracts plus processing-layer Write-Audit-Publish. What each layer catches, what it misses, and how they compose.

  2. The operational reality of WAP at scale: how to decide what’s a blocking check versus a warning, how audit failures get handled without paging on-call at 3am, and the architectural decision I’d change if I were doing it again today.

Audiences:
Senior data engineers, data architects, ML platform engineers, and engineering leaders responsible for production data infrastructure at scale. The talk assumes familiarity with streaming data systems and schemas, but doesn’t assume prior hands-on experience with Iceberg or the Write-Audit-Publish pattern specifically.

Bio
Asif Mansoor Amanullah is a Lead Data Engineer at Apple, specializing in large-scale data infrastructure, real-time streaming systems, and privacy-first analytics platforms. He has over a decade of experience building data systems at some of the world’s most technically demanding technology companies, leading the design and implementation of revenue-critical pipelines, unified analytics platforms, and audience data systems processing billions of events daily across petabytes of data.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures