Jul 2013
8 Mon
9 Tue
10 Wed
11 Thu 09:30 AM – 04:30 PM IST
12 Fri 10:15 AM – 05:30 PM IST
13 Sat 10:15 AM – 05:30 PM IST
14 Sun
Mahesh Tiyyagura
Will explain a beautiful and elegant solution that leverages Search Index (Solr/Lucene), simple shell scripts and Union-Find to help you group the similar records on such large data-sets.
Detecting similar or near duplicate records is a common and recurring problem in different domains. But, it is not something we can solve just by spinning up a Hadoop Cluster in EC2 and running a MapReduce code.
The solution leverages the simplicity of a search index (Solr/Lucene) and a few shell scripts to glue together the tools available in opensource. Will also explain what and how you can extend this solution to run faster if you have multiple nodes and points where the solution can be customized for your specific problem.
Curiosity and Modesty
BigData Analytics is no BigDeal; What we we will focus on is -- understanding the core abstractions and tools available in the open-source (Search, MapReduce, NoSQL etc..) to solve these kind of problems.
Mahesh, loves technology and hacking code. His curiosity keeps him updated on best and latest in Tech. Currently, He heads the engineering group that builds Order processing, Warehousing, Fulfillment systems at Homeshop18. Prior to this; He was an Entrepreneur working on large scale crawl and extraction of structured data from the web. He is a Ex-Yahoo! and alumnus of IIT Kanpur.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}