The Fifth Elephant 2018

The seventh edition of India's best data conference

Improve data quality using Apache Airflow and check operator

Submitted by Sakshi Bansal (@sakshi28) on Wednesday, 21 March 2018

videocam
Preview video

Technical level

Intermediate

Section

Crisp talk

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +25

Abstract

The Data Team at Qubole collects usage and telemetry data from a million machines a month. We run many complex ETL workflows to process this data and provide reports, insights and recommendations to customers, analysts and data scientists. We use open source distribution of Apache Airflow to orchestrate our ETLs and process more than 1 terabyte of data daily.

These ETLs differ in terms of frequencies, types of data, transformation logic and their SLA’s. Due to the volume of data and differences amongst ETLs, it becomes difficult to monitor the quality of data. Errors are introduced at all stages - extraction, transformation or load and usually happen due to infrastructural or logical issues.

In order to catch these errors, we came up with the idea of using assert queries, just like we have assert statements in a unit test framework. These queries would run after an extraction/transformation/load step has finished and run some predefined diagnostic queries on the data to match the output against some expected value.

In this talk, I will

  • Discuss the complexities involved in detecting discrepancies in the output of any data transformation process and protecting any downstream process in case of any issue.

  • Introduce the approach we have adopted for running these assert queries based on the Check operator in Apache Airflow to quantify data quality and alert on it.

  • Discuss the enhancements we have made in the Qubole’s fork of Apache Airflow’s check operator in order to use it at a bigger scale and with more variety of data. We plan to contribute these enhancements back to Apache Airflow soon.

  • Talk about the lessons learnt and best practices in maintaining data sanity for data in motion.

We have integrated most of our ETLs with these data quality verification techniques, and the results look promising. We have been able to make this work across ETLs having nothing in common but the fact that they run on Apache Airflow.

Outline

  1. Data quality issues we faced with data ingestion/transformation.
  2. Approach we have adopted using Apache airflow check operators.
  3. Enhancements we had to make to Check operators.
  4. Integration of Apache Airflow Check operators with our ETLs.
  5. Challenges faced in developing the alerting framework.
  6. Lesson learnt and best practices in using Apache Airflow for data quality checks.
  7. Limitations and Future work.

Speaker bio

Sakshi is a graduate from BITS Pilani and has been working with Qubole for the last 2 years. She has worked with the data team at Qubole and was involved in building a data streaming platform and data warehouse for the company.

Slides

https://docs.google.com/presentation/d/11FUv-UvpftgqMZOHJBuyrd7t6pdWF9Ln4uIJRLZ40XA/edit?usp=sharing

Preview video

https://youtu.be/GC1BPyiVDms

Comments

  • 1
    Venkata Pingali (@venkatapingali) 7 months ago

    Looks good.

    Few minor observations:

    1. A one-slider with a sample airflow dag with operator will help understand when you say ‘operator’
    2. Some slides are verbose and use small font.
  • 1
    Sakshi Bansal (@sakshi28) Proposer 7 months ago

    The first slide, “setting the context” does exactly what you mentioned in the first point.
    I’ll make some changes to the slides to make them less verbose.
    Thanks for the feedback!

    • 1
      Zainab Bawa (@zainabbawa) Reviewer 6 months ago

      Sakshi, also add more details on what type of assertions worked, have some examples of how it works in production.

      • 1
        Sakshi Bansal (@sakshi28) Proposer 6 months ago

        Sure. Will do that.

Login with Twitter or Google to leave a comment