arrow_back Fuzzy Deduplication of records at scale
Maintaining Data Pipelines' Sanity at Scale : How Validations and Metric Visualization came to our rescue!
Submitted by Akash Khandelwal (@akash099) on Monday, 15 April 2019
Session type: Lecture Session type: Full talk of 40 mins
Have you ever been through a nightmare when corrupt data from an upstream source led to a rogue index push to prod?
In this talk, I’ll walk through via case studies from our work at Flipkart :
1. Writing test cases for data pipelines. Validating datasets and generated patterns in addition to business logic.
2. Capturing and visualizating important metrics, and alerting. In-Lab and External recurring evaluation.
3. Brining Order to Chaos. Dealing With Staleness and Volume Drop.
Akash is a software developer with Search Autosuggest team at Flipkart. Previously, he has worked on building Flipkart Recommendation System. He designed real time and batch pipelines to power recommendations, including use cases such as product bundling, similar products and personalisation. He is interested in applying Machine Learning for pattern mining, and deploying data processing pipelines at scale. He graduated with a dual degree in Computer Science & Engineering from IIT Delhi.