Managing Infrastructure for Machine Learning Platform at Walmart scale - Using Kubernetes as the backbone
Submitted by Ravishankar Suribabu (@ravishankarks) on Wednesday, 27 March 2019
Session type: Full talk of 40 mins
One of the most critical challenges in bringing Machine Learning to practice is to avoid the various technical debt traps which the data science teams focus on in their day to day jobs. Building a Machine Learning Platform at Walmart has a single agenda i.e. to make it easy for data scientists to use the company’s data to train/build new ML models at scale and making the “single click” deployment experience seamless – However, this experience is possible only by providing a robust infrastructure back-end for the platform.
I would like to share the learnings from setting up infrastructure for building the infrastructure back-end for Machine Learning Platform at Walmart. I would elaborate primarily on how we have used kubernetes as the container management solution for the platform. Key features such as - Dynamic Scaling in kubernetes, going hybrid-cloud with kubernetes, managing very large kubernetes clusters, managing security, resource scheduling and priority, CI/CD kubernetes deployment pipeline, supporting heterogeneous VMs in a single kubernetes cluster (both cpu/gpu), and kubernetes monitoring would be discussed
I would also evaluate our container management solution in comparison with Amazon EKS, Google GKE, Azure AKS and list out the challenges
This talk reflects our journey over the past 14 months – as we went through the journey – starting from a small infrastructure setup on private cloud to going hybrid with 4000+ cores of usage for ML workloads – yet keeping the various DevOps and infrastructure aspects abstracted from data scientists and our platform users
Brief Content Flow
Introduction [3 mins]
A short introduction to the topics that are to be covered. Beginning with the initial introduction to Machine Learning Platform, this section sets the base for discussing the the infrastructure needs in further slides
Infrastructure needs for Machine Learning Platform - Challenges [5 mins]
We will elaborate on the infrastructure needs to support such a platform
Deep-dive into platform infrastructure layers [8 mins]
We will go through a deep-dive into the infrastructure layers of the platform. We will look at the various tools we used, and the choices we had.
Infrastructure Tech stack for container management
Learnings at Walmart scale [12 mins]
Here, we would extend the previous discussion to discuss in detail on how we the entire infrastructure back-end works on kubernetes. We will look at the below challenges/requirements and how we achieved them.
Topics covered: Important aspects of the learnings with kubernetes setup will be discussed:
managing very large kubernetes clusters,
billing our tenants/users,
supporting heterogeneous workloads (both cpu/gpu), and
Lessons Learnt [5 mins]
Again, we discuss some interesting issues we got through - and how not resolving them in time can be trouble in waiting
Conclusion [3 mins]
With a summary of above points covered, a few concluding remarks will be presented.
Overall Q&A [5 mins]
Reserved for Q&A.
Knowledge on kubernetes and docker Technical stack is a must
Aimed at machine learning and DevOps enthusiasts who wish to get started with understanding infrastructure for building platforms at scale with kubernetes.
RaviShankar is a member of the Machine Learning Platform team at Walmart. He has ~14 years of experience in IT. He has completed his masters from BITS Pilani, and completed Executive Management Programme(EGMP) course from IIM,Bangalore. Before working with walmart, he has worked with IBM Labs and Yahoo. He has rich experience working at various levels of the application stack. In current portfolio, Ravi manages the end-to-end infrastructure of the Machine Learning platform.