Jul 2019
22 Mon
23 Tue
24 Wed
25 Thu 09:15 AM – 05:45 PM IST
26 Fri 09:20 AM – 05:30 PM IST
27 Sat
28 Sun
Suvrat Hiran
Quality of the data stored have significant implications to a product/system that relies on information. Unfortunately, data is entred erroneously into the system creating duplicate entry. This leads to decrease in the quality of data retrieval for any product/system.
Particularly for Freshworks, we are looking at incorporating deduplication as a feature in our CRM product, Freshsales. Here deduplication would help customer’s sales teams organize their database more efficiently. Duplicates entry could be because of spelling mistakes, abbreviation usage, different order of words, etc. We have built a machine learning system which does deduplication for over fifty million records for streaming data as well as spark based system for static data.
Problem overview
Challenges
- Scaling the solution to work for millions of records. Tackling n^2 problem (Each records if compared with every other record).
- Model for duplicate detection should be robust to find duplicates even if there are spelling mistakes, phonetic matches, empty fields, punctuations, salutations, field mismatches, abbreviations, variations on phone numbers (state code, area code etc.)
- Storing duplicate efficiently and keeping it cost effective.
- Scoring and training model when there is lack of tagged data.
- How to improve model based on user feedback.
Modeling methodology
Production deployment
My name is Suvrat Hiran, I have been working with Freshworks for last 6 months. I work at juncture of engineering and data science. During my 7+ years of industrial experience I have built various machine learning products at scale. I have previously worked with adtech, marketing automation, enterprise AI companies helping them build machine learning products. I graduated from IIT Kharagpur in Statistics and Informatics in 2011.
Hosted by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}