Find Near Duplicate records in your Data
Submitted by Mahesh Tiyyagura (@tmahesh) on Saturday, 18 May 2013
- Find customers with multiple mobile nrs in a dataset of 2Mn customer records
- Identify different editions of a book in a catalogue of 10Mn ISBN's
- How many unique customers have ordered these 6Mn orders at your call center
Will explain a beautiful and elegant solution that leverages Search Index (Solr/Lucene), simple shell scripts and Union-Find to help you group the similar records on such large data-sets.
Detecting similar or near duplicate records is a common and recurring problem in different domains. But, it is not something we can solve just by spinning up a Hadoop Cluster in EC2 and running a MapReduce code.
The solution leverages the simplicity of a search index (Solr/Lucene) and a few shell scripts to glue together the tools available in opensource. Will also explain what and how you can extend this solution to run faster if you have multiple nodes and points where the solution can be customized for your specific problem.
Curiosity and Modesty
BigData Analytics is no BigDeal; What we we will focus on is -- understanding the core abstractions and tools available in the open-source (Search, MapReduce, NoSQL etc..) to solve these kind of problems.
Mahesh, loves technology and hacking code. His curiosity keeps him updated on best and latest in Tech. Currently, He heads the engineering group that builds Order processing, Warehousing, Fulfillment systems at Homeshop18. Prior to this; He was an Entrepreneur working on large scale crawl and extraction of structured data from the web. He is a Ex-Yahoo! and alumnus of IIT Kanpur.