The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Tickets

Diksuchi: Data quality Monitoring platform for @scale batch data pipelines at Walmart

Submitted by Pruthvi Raj (@pruthvirajeranti) on Friday, 14 June 2019

Session type: Short talk of 20 mins

Abstract

We the customer Backbone team at Walmart, are building customer identity and activity graph with around 20+ Billion nodes and 30 Billion edges, that works to be the lifeline of customer data for multiple pillars such as marketing, targeting, personalization, data sciences, etc. While building the graph using spark and hive pipelines, we generate many intermediate tables/states and output tables.
To provide high quality data to our teams, we have built a data quality monitoring platform, Diksuchi (meaning compass), that provides metrics, audit, monitoring, and alerting on our data pipelines for quick and easy debugging. The monitoring platform runs alongside the processing-heavy pipelines, heavy lifting the work of calculating metrics, checking the correctness of data, anomaly detection in the inputs & outputs of data and raise alarms. Diksuchi also provides dashboards for easy navigation and debugging.

The platform is uniquely developed to enable any developer, data scientist, analyst to write a simple configuration and onboard a new data processing pipeline anywhere in Walmart and monitor its data quality and correctness.

Outline

This talk will be covering below topics -
1. Introduction to problem statement
2. Why every company need this kind of platform
3. Journey of metrics platform
3. Demo of the platform
4. Setting alert rules & Anomaly detection
5. Takeaways from platform

Requirements

Basic understanding of Spark, Airflow, Elasticsearch Hive, Grafana

Speaker bio

Pruthvi is a senior data engineer at WalmartLabs, and he is working on customer backbone team for more than an year. He along with his team members, developed a customer identity and activity graph platform and a data quality monitoring platform during this time

Comments

  • Abhishek Balaji (@booleanbalaji) Reviewer 5 months ago

    Hi Pruthvi,

    Thank you for submitting a proposal. We need to see detailed slides and a preview video to evaluate your proposal. Your slides must cover the following:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/frameworks available in the market to solve this problem? How did you evaluate these, and what metrics did you use for the evaluation? Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how it changed after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?

    We need your updated slides and preview video by Jun 27, 2019 to evaluate your proposal. If we do not receive an update, we’d be moving your proposal for evaluation under a future event.

  • Pruthvi Raj (@pruthvirajeranti) Proposer 4 months ago

    Hi Abhishek,

    Thanks for the inputs, we incorporate it soon and send you updated version

    Thanks

    • Abhishek Balaji (@booleanbalaji) Reviewer 4 months ago

      Marked as rejected since proposer hasnt responded to comments/updated content before deadline. Will be considered for a future event if content is updated.

Login with Twitter or Google to leave a comment