Find Near Duplicate records in your Data

Jul 2013

8 Mon

9 Tue

10 Wed

11 Thu 09:30 AM – 04:30 PM IST

12 Fri 10:15 AM – 05:30 PM IST

13 Sat 10:15 AM – 05:30 PM IST

14 Sun

Nimhans Convention Centre

Find Near Duplicate records in your Data

Submitted May 19, 2013

Section: Workshops Technical level: Intermediate

Find customers with multiple mobile nrs in a dataset of 2Mn customer records
Identify different editions of a book in a catalogue of 10Mn ISBN’s
How many unique customers have ordered these 6Mn orders at your call center

Will explain a beautiful and elegant solution that leverages Search Index (Solr/Lucene), simple shell scripts and Union-Find to help you group the similar records on such large data-sets.

Outline

Detecting similar or near duplicate records is a common and recurring problem in different domains. But, it is not something we can solve just by spinning up a Hadoop Cluster in EC2 and running a MapReduce code.

The solution leverages the simplicity of a search index (Solr/Lucene) and a few shell scripts to glue together the tools available in opensource. Will also explain what and how you can extend this solution to run faster if you have multiple nodes and points where the solution can be customized for your specific problem.

Requirements

Curiosity and Modesty

BigData Analytics is no BigDeal; What we we will focus on is -- understanding the core abstractions and tools available in the open-source (Search, MapReduce, NoSQL etc..) to solve these kind of problems.

Speaker bio

Mahesh, loves technology and hacking code. His curiosity keeps him updated on best and latest in Tech. Currently, He heads the engineering group that builds Order processing, Warehousing, Fulfillment systems at Homeshop18. Prior to this; He was an Entrepreneur working on large scale crawl and extraction of structured data from the web. He is a Ex-Yahoo! and alumnus of IIT Kanpur.

The Fifth Elephant 2013