Submissions for MLOps November edition

On ML workflows, tools, automation and running ML in production

This project is accepting submissions for MLOps November conference edition.

The first edition of the MLOps conference was held on 23, 24 and 27 July. Details about the conference including videos and blog posts are published at https://hasgeek.com/fifthelephant/mlops-conference/

Contact information: For inquiries, contact The Fifth Elephant on fifthelephant.editorial@hasgeek.com or call 7676332020.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Anay Nayak

@anaynayak

Monitoring Data Quality at Scale

Submitted Jul 14, 2021

Level : Beginner

Timing: 15 min

Abstract

Data drift and data cascades are real problems that can wreck havoc with any business insights. When operating with data at scale and dealing with external systems, any changes in data can cause cascading impact through all the data pipelines which are difficult to trace and incur significant cost for correcting data. Data quality frameworks like Deequ / Great Expectations provide key capabilities which help monitor data automatically and generate alerts so that the team is proactively notified.

Deequ

  • DSL over Spark for unit-testing data
  • DSL abstraction for defining data quality checks
  • Why static thresholds don’t work
  • Anomaly based checks to avoid static thresholds

Integrating into ‘Ops’

  • Visualising metrics over time
  • Alerting

Takeaways

  • Benefits of Data Quality frameworks like Deequ
  • How to define good data quality metrics
  • Known limitations / alternatives

https://docs.google.com/presentation/d/1YcUZq3GlkDNPZEZxnoQDT3YzroiISCaSppzUuWjPGds/edit?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more