The Fifth Elephant 2018

The Fifth Elephant 2018

The seventh edition of India's best data conference

Soumya Shukla


Segmenting 500 million users using Airflow + Hive

Submitted Mar 31, 2018

Walmart is the largest retail company in US, with both online and offline presence. It reaches millions of users in all possible ways. Physical stores, an ecommerce website , exclusive sams club and to name a few.

As a marketing team, you need one holistic platform for all customers so that you can provide them with the latest and the best offers. You need all the information you can get about a user and his journey across all your platforms. If you want to promote a new iPhone you need to reach all the people who are associated with Walmart and have shown an active interest in Apple Products, on any of these platforms.

This becomes a gigantic problem when there are 500 million customers and their daily activities are lying around 20+ different systems. Additionally its not just customer data but their purchase and browsing history as well. It becomes a colossal task for making sense out of terabytes of data.

In this talk, I will speak about how we took this humongous set of data from multiple sources, joined user history across platforms, sanitized the data and published a single source with the bird’s eye view. I will also talk about how we made it easy for users to create segments on top of this data and target their audience better.


  • Problem Statement - Segmenting 500 million Users using data from 20+ different sources.
  • Generating the customer data
  • Join the customer data from multiple sources
  • Data sanitization and reliability checks
  • Publishing the data for easy use

Speaker bio

I, Soumya Shukla, have been a software developer for 6+ years. I have worked for Amazon and I’m now working as a senior software developer at WalmartLabs, India.



{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

All about data science and machine learning