The Fifth Elephant 2018

The seventh edition of India's best data conference

Segmenting 500 million users using Airflow + Hive

Submitted by Soumya Shukla (@soumyashukla22) on Saturday, 31 March 2018

videocam
Preview video

Technical level

Intermediate

Section

Crisp talk

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +4

Abstract

Walmart is the largest retail company in US, with both online and offline presence. It reaches millions of users in all possible ways. Physical stores, an ecommerce website , exclusive sams club and jet.com to name a few.

As a marketing team, you need one holistic platform for all customers so that you can provide them with the latest and the best offers. You need all the information you can get about a user and his journey across all your platforms. If you want to promote a new iPhone you need to reach all the people who are associated with Walmart and have shown an active interest in Apple Products, on any of these platforms.

This becomes a gigantic problem when there are 500 million customers and their daily activities are lying around 20+ different systems. Additionally its not just customer data but their purchase and browsing history as well. It becomes a colossal task for making sense out of terabytes of data.

In this talk, I will speak about how we took this humongous set of data from multiple sources, joined user history across platforms, sanitized the data and published a single source with the bird’s eye view. I will also talk about how we made it easy for users to create segments on top of this data and target their audience better.

Outline

  • Problem Statement - Segmenting 500 million Users using data from 20+ different sources.
  • Generating the customer data
  • Join the customer data from multiple sources
  • Data sanitization and reliability checks
  • Publishing the data for easy use

Speaker bio

I, Soumya Shukla, have been a software developer for 6+ years. I have worked for Amazon and I’m now working as a senior software developer at WalmartLabs, India.

Slides

https://drive.google.com/file/d/1Engmj9ZY0B8lPNuF-6HZb_uAR5QUpsHA/view?usp=sharing

Preview video

https://www.youtube.com/watch?v=cIzKwQmKW8I

Comments

  • 1
    Hari C M (@haricm) Reviewer 8 months ago

    In your proposal, you have mentioned the problem in detail, but you haven’t touched up on how you solved it properly. I can see you mentioning Airflow and Hive in title, but it’s not mentioned anywhere else. Please update your proposal with more details.

  • 1
    Zainab Bawa (@zainabbawa) Reviewer 8 months ago

    Soumya, without slides and preview video, we won’t evaluate this proposal.

    • 1
      Zainab Bawa (@zainabbawa) Reviewer 7 months ago

      Thanks for the slides. Awaiting preview video.

  • 1
    Zainab Bawa (@zainabbawa) Reviewer 7 months ago

    The slides mention why you used Airflow. What other options did you consider apart from Airflow? How did you compare these options? Why did you finally choose Airflow over other options?

  • 1
    Venkata Pingali (@venkatapingali) 7 months ago

    The talk could be interesting to all of us but it could be
    more than using airflow to schedule hive jobs. A few aspects that will
    make it interestings are:

    1. How does the system handle recovery from inconsistent and partial data from each of the sources?
    2. How does it handle full and incremental computation, and failures in the computation?
    3. How is able to standardize output schemas of intermediate stages, and handle the evolution of these schemas?

Login with Twitter or Google to leave a comment