The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem


Fuzzy Deduplication of records at scale

Submitted by Suvrat Hiran (@suvrathiran) on Monday, 15 April 2019

Session type: Lecture Session type: Short talk of 20 mins Status: Rejected


Quality of the data stored have significant implications to a product/system that relies on information. Unfortunately, data is entred erroneously into the system creating duplicate entry. This leads to decrease in the quality of data retrieval for any product/system.
Particularly for Freshworks, we are looking at incorporating deduplication as a feature in our CRM product, Freshsales. Here deduplication would help customer’s sales teams organize their database more efficiently. Duplicates entry could be because of spelling mistakes, abbreviation usage, different order of words, etc. We have built a machine learning system which does deduplication for over fifty million records for streaming data as well as spark based system for static data.


  • Problem overview

  • Challenges

    • Scaling the solution to work for millions of records. Tackling n^2 problem (Each records if compared with every other record).
    • Model for duplicate detection should be robust to find duplicates even if there are spelling mistakes, phonetic matches, empty fields, punctuations, salutations, field mismatches, abbreviations, variations on phone numbers (state code, area code etc.)
    • Storing duplicate efficiently and keeping it cost effective.
    • Scoring and training model when there is lack of tagged data.
    • How to improve model based on user feedback.
  • Modeling methodology

    • Building training data wihout having clean tagged dataset.
    • Scaling deduplication using blocking. Records were grouped into blocks based on basic similarity.
    • Model choice and feature details
    • Model evaluation and metric
    • Active learning
  • Production deployment

    • Searching and storing duplicates for static and streaming data.
    • Spark was used to find duplicates already present in the system. Kafka consumer were written to find duplicates when any new record was added.
    • Spark and native python compatibility such that core model modules work for both.
    • Graph database for storage.
    • There are about ~100M records to be de-duplicated. We stored record ids as vertex and duplicate records were connected via edges. Graph holds close to ~40M vertex and and ~80M edges.
    • Fetching duplicates is O(1) operation.

Speaker bio

My name is Suvrat Hiran, I have been working with Freshworks for last 6 months. I work at juncture of engineering and data science. During my 7+ years of industrial experience I have built various machine learning products at scale. I have previously worked with adtech, marketing automation, enterprise AI companies helping them build machine learning products. I graduated from IIT Kharagpur in Statistics and Informatics in 2011.




  •   Suvrat Hiran (@suvrathiran) Proposer 10 months ago

    @zainabbaw Can you confirm if this submission needs to add video or slides?

    •   Zainab Bawa (@zainabbawa) Reviewer 9 months ago

      Every submission has to have slides and preview video. Preview video hasn’t been added to this proposal.

      •   Zainab Bawa (@zainabbawa) Reviewer 9 months ago (edited 9 months ago)

        Here is the feedback from the review:

        1. The slides are very thin and don’t cover much details. The abstract is more well defined.
        2. Based on the outline mentioned in the abstract, the proposed talk is geared heavily towards describing the problem and the solution. The problem of de-duplication being well known, what will be more interesting is to share why you chose this modelling and engineering approach? What are the pitfalls of this approach?
  •   Abhishek Balaji (@booleanbalaji) Reviewer 8 months ago

    Marking this as reject since proposer hasnt submitted a preview video or updated slides.

Login with Twitter or Google to leave a comment