Submissions for MLOps November edition

On ML workflows, tools, automation and running ML in production

Anay Nayak

@anaynayak

Monitoring Data Quality at Scale

Submitted Jul 14, 2021

Level : Beginner

Timing: 15 min

Abstract

Data drift and data cascades are real problems that can wreck havoc with any business insights. When operating with data at scale and dealing with external systems, any changes in data can cause cascading impact through all the data pipelines which are difficult to trace and incur significant cost for correcting data. Data quality frameworks like Deequ / Great Expectations provide key capabilities which help monitor data automatically and generate alerts so that the team is proactively notified.

Deequ

  • DSL over Spark for unit-testing data
  • DSL abstraction for defining data quality checks
  • Why static thresholds don’t work
  • Anomaly based checks to avoid static thresholds

Integrating into ‘Ops’

  • Visualising metrics over time
  • Alerting

Takeaways

  • Benefits of Data Quality frameworks like Deequ
  • How to define good data quality metrics
  • Known limitations / alternatives

https://docs.google.com/presentation/d/1YcUZq3GlkDNPZEZxnoQDT3YzroiISCaSppzUuWjPGds/edit?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures