Diksuchi: Data quality Monitoring platform for @scale batch data pipelines at Walmart

Jul 2019

22 Mon

23 Tue

24 Wed

25 Thu 09:15 AM – 05:45 PM IST

26 Fri 09:20 AM – 05:30 PM IST

27 Sat

28 Sun

NIMHANS Convention Centre, Bengaluru

Diksuchi: Data quality Monitoring platform for @scale batch data pipelines at Walmart

Submitted Jun 14, 2019

Session type: Short talk of 20 mins

We the customer Backbone team at Walmart, are building customer identity and activity graph with around 20+ Billion nodes and 30 Billion edges, that works to be the lifeline of customer data for multiple pillars such as marketing, targeting, personalization, data sciences, etc. While building the graph using spark and hive pipelines, we generate many intermediate tables/states and output tables.
To provide high quality data to our teams, we have built a data quality monitoring platform, Diksuchi (meaning compass), that provides metrics, audit, monitoring, and alerting on our data pipelines for quick and easy debugging. The monitoring platform runs alongside the processing-heavy pipelines, heavy lifting the work of calculating metrics, checking the correctness of data, anomaly detection in the inputs & outputs of data and raise alarms. Diksuchi also provides dashboards for easy navigation and debugging.

The platform is uniquely developed to enable any developer, data scientist, analyst to write a simple configuration and onboard a new data processing pipeline anywhere in Walmart and monitor its data quality and correctness.

Outline

This talk will be covering below topics -

Introduction to problem statement
Why every company need this kind of platform
Journey of metrics platform
Demo of the platform
Setting alert rules & Anomaly detection
Takeaways from platform

Requirements

Basic understanding of Spark, Airflow, Elasticsearch Hive, Grafana

Speaker bio

Pruthvi is a senior data engineer at WalmartLabs, and he is working on customer backbone team for more than an year. He along with his team members, developed a customer identity and activity graph platform and a data quality monitoring platform during this time

The Fifth Elephant 2019