arrow_back Designing a Data Pipeline at Scale
Fuzzy Deduplication of records at scale
Submitted by Suvrat Hiran (@suvrathiran) on Monday, 15 April 2019
Session type: Lecture Session type: Short talk of 20 mins
Quality of the data stored have significant implications to a product/system that relies on information. Unfortunately, data is entred erroneously into the system creating duplicate entry. This leads to decrease in the quality of data retrieval for any product/system.
Particularly for Freshworks, we are looking at incorporating deduplication as a feature in our CRM product, Freshsales. Here deduplication would help customer’s sales teams organize their database more efficiently. Duplicates entry could be because of spelling mistakes, abbreviation usage, different order of words, etc. We have built a machine learning system which does deduplication for over fifty million records for streaming data as well as spark based system for static data.
- Scaling the solution to work for millions of records. Tackling n^2 problem (Each records if compared with every other record).
- Model for duplicate detection should be robust to find duplicates even if there are spelling mistakes, phonetic matches, empty fields, punctuations, salutations, field mismatches, abbreviations, variations on phone numbers (state code, area code etc.)
- Storing duplicate efficiently and keeping it cost effective.
- Scoring and training model when there is lack of tagged data.
- How to improve model based on user feedback.
- Building training data wihout having clean tagged dataset.
- Scaling deduplication using blocking. Records were grouped into blocks based on basic similarity.
- Model choice and feature details
- Model evaluation and metric
- Active learning
- Searching and storing duplicates for static and streaming data.
- Spark was used to find duplicates already present in the system. Kafka consumer were written to find duplicates when any new record was added.
- Spark and native python compatibility such that core model modules work for both.
- Graph database for storage.
- There are about ~100M records to be de-duplicated. We stored record ids as vertex and duplicate records were connected via edges. Graph holds close to ~40M vertex and and ~80M edges.
- Fetching duplicates is O(1) operation.
My name is Suvrat Hiran, I have been working with Freshworks for last 6 months. I work at juncture of engineering and data science. During my 7+ years of industrial experience I have built various machine learning products at scale. I have previously worked with adtech, marketing automation, enterprise AI companies helping them build machine learning products. I graduated from IIT Kharagpur in Statistics and Informatics in 2011.