Building a customer identity graph platform on Hadoop that handles 20+ Billion vertices & 30+ Billion edges at Walmart
Submitted by Albin Kuruvilla (@albinkuruvilla) on Thursday, 13 June 2019
Session type: Full talk of 40 mins
Walmart generates millions of customer activity events per second through various channels and business platforms, in different customer Identity space (such as cookies, emailIDs, Walmart IDs, 3P IDs etc.). Identifying/Linking a user across channels helps in better understanding the customer persona and engaging them better with Walmart.
Therefore creating a customer identity and activity graph of connected identities was critical. In terms of scale, there are over 20+ Billion identities to ingest into the graph pipeline, and an incremental 200M new linkages every day.
In order to provide fresh linkage data as events happen, we wanted a lean and efficient algorithm to build connected components from linkages. We looked at existing frameworks such as Spark GraphX, distributed graph databases, but built our own graph processing framework for scale and performance. In this talk, we would like to present the journey of building the graph platform. We incorporated optimization strategies such as bucketing, ID locality, etc., that helped bring down the processing time significantly. Currently we can run the graph build (from scratch) for 20+ Billion nodes and linkages in 6-8 hours, and incremental linkage updates in around 5 hours on a daily basis.
Keywords: Connected components, Graph, Union Find, GraphX, customer identity mapping
This talk will be covering below topics -
- Graph input - How do we find linkages based on cooccurrence of customer identifiers & what are the different type of identiers in graph
- Use cases that we are looking to solve
- Building blocks of our graph - key tables & traversal mechanism
- Steps involved in graph generation
- Initial solution based on Hive,Spark GraphX
- How did we scale the graph to support 4 Billion vertices
- Scaling the graph to support 20B vertices
- Graph Data Quality challenges - How do we handle them
- Why we call it a ‘graph platform’
- Next steps/Upcoming features for the graph platform
For more information, please refer https://www.slideshare.net/secret/1ax3wM60pgN3Ea
Basic understanding about bigdata ecosystem, graph
Albin is a senior data engineer at WalmartLabs, and he is working on customer identity graph projects for more than an year. He along with his team members, developed a graph processing engine at high scale using hive tables as storage layer and spark as the processing engine. He is interested in solving complex problems and in working with huge datasets.