The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

Find Near Duplicate records in your Data

Submitted by Mahesh Tiyyagura (@tmahesh) on Saturday, 18 May 2013

videocam_off

Technical level

Intermediate

Section

Workshops

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +5

Objective

  1. Find customers with multiple mobile nrs in a dataset of 2Mn customer records
  2. Identify different editions of a book in a catalogue of 10Mn ISBN's
  3. How many unique customers have ordered these 6Mn orders at your call center

Will explain a beautiful and elegant solution that leverages Search Index (Solr/Lucene), simple shell scripts and Union-Find to help you group the similar records on such large data-sets.

Description

Detecting similar or near duplicate records is a common and recurring problem in different domains. But, it is not something we can solve just by spinning up a Hadoop Cluster in EC2 and running a MapReduce code.

The solution leverages the simplicity of a search index (Solr/Lucene) and a few shell scripts to glue together the tools available in opensource. Will also explain what and how you can extend this solution to run faster if you have multiple nodes and points where the solution can be customized for your specific problem.

Requirements

Curiosity and Modesty

BigData Analytics is no BigDeal; What we we will focus on is -- understanding the core abstractions and tools available in the open-source (Search, MapReduce, NoSQL etc..) to solve these kind of problems.

Speaker bio

Mahesh, loves technology and hacking code. His curiosity keeps him updated on best and latest in Tech. Currently, He heads the engineering group that builds Order processing, Warehousing, Fulfillment systems at Homeshop18. Prior to this; He was an Entrepreneur working on large scale crawl and extraction of structured data from the web. He is a Ex-Yahoo! and alumnus of IIT Kanpur.

Links

Comments

Login with Twitter or Google to leave a comment