De-dup on Hadoop
Submitted by Neeta Pande (@neetapande) on Thursday, 12 June 2014
In this talk, I wish to share experiences we had at Intuit in building Master Data Management solution on Hadoop platform. At the core MDM solution consists of fuzzy matching, entity resolution and de-duplication. Solving these patterns on Big Data Platform like Hadoop is the focus of this discussion.
In many enterprises it's commonly seen that business data has a lot of client, customer, vendor or product lists in different formats and systems, many of which are near duplicates.MDM solutions on RDBMS have been prominent for many years in almost every enterprise to support master data management by removing duplicates, standardizing data and incorporating rules to eliminate incorrect data from entering the system in order to create an authoritative source of master data. MDM on Big data platforms like Hadoop have benefits as well as it's own set of challenges when compared with the RDBMS counterparts. I will cover them in detail primarily focusing on building this solution on Hadoop.
I am Data Architect at Intuit with 13+ years of experience in BI and Data Analytics. Prior to Intuit, I have worked at Intel, Oracle and EMC applying BI in Manufacturing, Finance and Storage Analytics domain.