De-dup on Hadoop

Jul 2014

21 Mon

22 Tue

23 Wed 09:30 AM – 05:00 PM IST

24 Thu 09:45 AM – 05:00 PM IST

25 Fri 08:30 AM – 07:15 PM IST

26 Sat 08:30 AM – 07:15 PM IST

27 Sun

NIMHANS Convention Centre, Bangalore

All submissions

Previous Next

This submission has been added to the schedule

De-dup on Hadoop

Submitted Jun 12, 2014

Section: Crisp talk Technical level: Beginner

In this talk, I wish to share experiences we had at Intuit in building Master Data Management solution on Hadoop platform. At the core MDM solution consists of fuzzy matching, entity resolution and de-duplication. Solving these patterns on Big Data Platform like Hadoop is the focus of this discussion.

Outline

In many enterprises it’s commonly seen that business data has a lot of client, customer, vendor or product lists in different formats and systems, many of which are near duplicates.MDM solutions on RDBMS have been prominent for many years in almost every enterprise to support master data management by removing duplicates, standardizing data and incorporating rules to eliminate incorrect data from entering the system in order to create an authoritative source of master data. MDM on Big data platforms like Hadoop have benefits as well as it’s own set of challenges when compared with the RDBMS counterparts. I will cover them in detail primarily focusing on building this solution on Hadoop.

Speaker bio

I am Data Architect at Intuit with 13+ years of experience in BI and Data Analytics. Prior to Intuit, I have worked at Intel, Oracle and EMC applying BI in Manufacturing, Finance and Storage Analytics domain.