The Fifth Elephant 2024 Annual Conference (12th &13th July)

Maximising the Potential of Data — Discussions around data science, machine learning & AI

Harini Anand

@inirah02

Sonal Goyal

@sonalgoyal

Getting dimensions right! A sneak peak at entity resolution in the warehouse and datalake

Submitted Jun 22, 2024

Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields, which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs, or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce the right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.

With a modern data stack and DataOps, we have established patterns for E and L in ELT for building data warehouses, datalakes and deltalakes. However, the T - getting data ready for analytics still needs a lot of effort. Modern tools like dbt are actively and successfully addressing this. What is also needed is a quick and scalable way to build the single source of truth of core business entities post Extraction and pre or post Loading.

This talk will cover the problem of entity resolution, its challenges and practical approaches to overcome them.

Outline

  • The data stack tools and technologies
  • The problem of entity resolution
  • Business impact of entity resolution
  • Challenges in building ER systems
  • Some approaches to solve ER challenges

Impact

The talk will equip the data and egineer and data scientist to think about entity resolution holistically as part of their data stack and use the appropriate tools and techniques to build scalable and accurate ER systems.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures

Supported by

Gold Sponsor

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Silver Sponsor

Together, we can build for everyone.

Workshop sponsor

Datastax, the real-time AI Company.

Lanyard Sponsor

We reimagine the way the world moves for the better.

Sponsor

MonsterAPI is an easy and cost-effective GenAI computing platform designed for developers to quickly fine-tune, evaluate and deploy LLMs for businesses.

Community Partner

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Beverage Partner

BONOMI is a ready to drink beverage brand based out of Bangalore. Our first segment into the beverage category is ready to drink cold brew coffee.