The Fifth Elephant 2014

A conference on big data and analytics

Neeta Pande


De-dup on Hadoop

Submitted Jun 12, 2014

In this talk, I wish to share experiences we had at Intuit in building Master Data Management solution on Hadoop platform. At the core MDM solution consists of fuzzy matching, entity resolution and de-duplication. Solving these patterns on Big Data Platform like Hadoop is the focus of this discussion.


In many enterprises it’s commonly seen that business data has a lot of client, customer, vendor or product lists in different formats and systems, many of which are near duplicates.MDM solutions on RDBMS have been prominent for many years in almost every enterprise to support master data management by removing duplicates, standardizing data and incorporating rules to eliminate incorrect data from entering the system in order to create an authoritative source of master data. MDM on Big data platforms like Hadoop have benefits as well as it’s own set of challenges when compared with the RDBMS counterparts. I will cover them in detail primarily focusing on building this solution on Hadoop.

Speaker bio

I am Data Architect at Intuit with 13+ years of experience in BI and Data Analytics. Prior to Intuit, I have worked at Intel, Oracle and EMC applying BI in Manufacturing, Finance and Storage Analytics domain.



{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

All about data science and machine learning