The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Tickets

Building a customer identity graph platform on Hadoop that handles 20+ Billion vertices & 30+ Billion edges at Walmart

Submitted by Albin Kuruvilla (@albinkuruvilla) on Thursday, 13 June 2019


Preview video

Session type: Full talk of 40 mins

Abstract

Walmart generates millions of customer activity events per second through various channels and business platforms, in different customer Identity space (such as cookies, emailIDs, Walmart IDs, 3P IDs etc.). Identifying/Linking a user across channels helps in better understanding the customer persona and engaging them better with Walmart.

Therefore creating a customer identity and activity graph of connected identities was critical. In terms of scale, there are over 20+ Billion identities to ingest into the graph pipeline, and an incremental 200M new linkages every day.

In order to provide fresh linkage data as events happen, we wanted a lean and efficient algorithm to build connected components from linkages. We looked at existing frameworks such as Spark GraphX, distributed graph databases, but built our own graph processing framework for scale and performance. In this talk, we would like to present the journey of building the graph platform. We incorporated optimization strategies such as bucketing, ID locality, etc., that helped bring down the processing time significantly. Currently we can run the graph build (from scratch) for 20+ Billion nodes and linkages in 6-8 hours, and incremental linkage updates in around 5 hours on a daily basis.

Keywords: Connected components, Graph, Union Find, GraphX, customer identity mapping

Outline

This talk will be covering below topics -

  1. Graph input - How do we find linkages based on cooccurrence of customer identifiers & what are the different type of identiers in graph
  2. Use cases that we are looking to solve
  3. Building blocks of our graph - key tables & traversal mechanism
  4. Steps involved in graph generation
  5. Initial solution based on Hive,Spark GraphX
  6. How did we scale the graph to support 4 Billion vertices
  7. Scaling the graph to support 20B vertices
  8. Graph Data Quality challenges - How do we handle them
  9. Why we call it a ‘graph platform’
  10. Next steps/Upcoming features for the graph platform

For more information, please refer https://www.slideshare.net/secret/1ax3wM60pgN3Ea

Requirements

Basic understanding about bigdata ecosystem, graph

Speaker bio

Albin is a senior data engineer at WalmartLabs, and he is working on customer identity graph projects for more than an year. He along with his team members, developed a graph processing engine at high scale using hive tables as storage layer and spark as the processing engine. He is interested in solving complex problems and in working with huge datasets.

Links

Slides

https://www.slideshare.net/secret/1ax3wM60pgN3Ea

Preview video

https://www.youtube.com/watch?v=8NyZ47FeTW8

Comments

  • jaya lekshmi (@jayalekshmi) 5 months ago

    very good proposal albin

  • aby mani (@abymani) 5 months ago

    👍 good job buddy !!

  • Abhishek Balaji (@booleanbalaji) Reviewer 4 months ago

    Hi Albin,

    Thank you for submitting a proposal. We need to see more detailed slides to evaluate your proposal. Your slides must cover the following:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/frameworks available in the market to solve this problem? How did you evaluate these, and what metrics did you use for the evaluation? Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how it changed after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?

    We need your updated slides and preview video by Jun 27, 2019 to evaluate your proposal. If we do not receive an update, we’d be moving your proposal for evaluation under a future event.

  • Albin Kuruvilla (@albinkuruvilla) Proposer 4 months ago

    Hi Abhishek,
    Thanks for the review comments, i will work on these and submit the updated slides & preview video within couple of days

Login with Twitter or Google to leave a comment