The Fifth Elephant 2018

The seventh edition of India's best data conference

Segmenting 500 million users using Airflow + Hive

Submitted by Soumya Shukla (@soumyashukla22) on Saturday, 31 March 2018

Section: Crisp talk Technical level: Intermediate Status: Confirmed & Scheduled


Walmart is the largest retail company in US, with both online and offline presence. It reaches millions of users in all possible ways. Physical stores, an ecommerce website , exclusive sams club and to name a few.

As a marketing team, you need one holistic platform for all customers so that you can provide them with the latest and the best offers. You need all the information you can get about a user and his journey across all your platforms. If you want to promote a new iPhone you need to reach all the people who are associated with Walmart and have shown an active interest in Apple Products, on any of these platforms.

This becomes a gigantic problem when there are 500 million customers and their daily activities are lying around 20+ different systems. Additionally its not just customer data but their purchase and browsing history as well. It becomes a colossal task for making sense out of terabytes of data.

In this talk, I will speak about how we took this humongous set of data from multiple sources, joined user history across platforms, sanitized the data and published a single source with the bird’s eye view. I will also talk about how we made it easy for users to create segments on top of this data and target their audience better.


  • Problem Statement - Segmenting 500 million Users using data from 20+ different sources.
  • Generating the customer data
  • Join the customer data from multiple sources
  • Data sanitization and reliability checks
  • Publishing the data for easy use

Speaker bio

I, Soumya Shukla, have been a software developer for 6+ years. I have worked for Amazon and I’m now working as a senior software developer at WalmartLabs, India.


Preview video


  • Hari C M (@haricm) 2 years ago

    In your proposal, you have mentioned the problem in detail, but you haven’t touched up on how you solved it properly. I can see you mentioning Airflow and Hive in title, but it’s not mentioned anywhere else. Please update your proposal with more details.

  • Zainab Bawa (@zainabbawa) Crew 2 years ago

    Soumya, without slides and preview video, we won’t evaluate this proposal.

    • Zainab Bawa (@zainabbawa) Crew 2 years ago

      Thanks for the slides. Awaiting preview video.

  • Zainab Bawa (@zainabbawa) Crew 2 years ago

    The slides mention why you used Airflow. What other options did you consider apart from Airflow? How did you compare these options? Why did you finally choose Airflow over other options?

  • Venkata Pingali (@pingali) 2 years ago

    The talk could be interesting to all of us but it could be
    more than using airflow to schedule hive jobs. A few aspects that will
    make it interestings are:

    1. How does the system handle recovery from inconsistent and partial data from each of the sources?
    2. How does it handle full and incremental computation, and failures in the computation?
    3. How is able to standardize output schemas of intermediate stages, and handle the evolution of these schemas?

Login to leave a comment