Managing Infrastructure for Machine Learning Platform at Walmart scale - Using Kubernetes as the backbone

Jul 2019

22 Mon

23 Tue

24 Wed

25 Thu 09:15 AM – 05:45 PM IST

26 Fri 09:20 AM – 05:30 PM IST

27 Sat

28 Sun

NIMHANS Convention Centre, Bengaluru

Managing Infrastructure for Machine Learning Platform at Walmart scale - Using Kubernetes as the backbone

Submitted Mar 27, 2019

Session type: Full talk of 40 mins

One of the most critical challenges in bringing Machine Learning to practice is to avoid the various technical debt traps which the data science teams focus on in their day to day jobs. Building a Machine Learning Platform at Walmart has a single agenda i.e. to make it easy for data scientists to use the company’s data to train/build new ML models at scale and making the “single click” deployment experience seamless – However, this experience is possible only by providing a robust infrastructure back-end for the platform.

I would like to share the learnings from setting up infrastructure for building the infrastructure back-end for Machine Learning Platform at Walmart. I would elaborate primarily on how we have used kubernetes as the container management solution for the platform. Key features such as - Dynamic Scaling in kubernetes, going hybrid-cloud with kubernetes, managing very large kubernetes clusters, managing security, resource scheduling and priority, CI/CD kubernetes deployment pipeline, supporting heterogeneous VMs in a single kubernetes cluster (both cpu/gpu), and kubernetes monitoring would be discussed

I would also evaluate our container management solution in comparison with Amazon EKS, Google GKE, Azure AKS and list out the challenges

This talk reflects our journey over the past 14 months – as we went through the journey – starting from a small infrastructure setup on private cloud to going hybrid with 4000+ cores of usage for ML workloads – yet keeping the various DevOps and infrastructure aspects abstracted from data scientists and our platform users

Outline

Brief Content Flow

Introduction [3 mins]
A short introduction to the topics that are to be covered. Beginning with the initial introduction to Machine Learning Platform, this section sets the base for discussing the the infrastructure needs in further slides

Infrastructure needs for Machine Learning Platform - Challenges [5 mins]
We will elaborate on the infrastructure needs to support such a platform

Deep-dive into platform infrastructure layers [8 mins]
We will go through a deep-dive into the infrastructure layers of the platform. We will look at the various tools we used, and the choices we had.
Topics covered:
Infrastructure Tech stack for container management

Learnings at Walmart scale [12 mins]
Here, we would extend the previous discussion to discuss in detail on how we the entire infrastructure back-end works on kubernetes. We will look at the below challenges/requirements and how we achieved them.
Topics covered: Important aspects of the learnings with kubernetes setup will be discussed:
managing very large kubernetes clusters,
dynamic scaling,
going hybrid-cloud,
managing security,
billing our tenants/users,
supporting heterogeneous workloads (both cpu/gpu), and
Monitoring

Lessons Learnt [5 mins]
Again, we discuss some interesting issues we got through - and how not resolving them in time can be trouble in waiting

Conclusion [3 mins]
With a summary of above points covered, a few concluding remarks will be presented.

Overall Q&A [5 mins]
Reserved for Q&A.

Requirements

Knowledge on kubernetes and docker Technical stack is a must
Aimed at machine learning and DevOps enthusiasts who wish to get started with understanding infrastructure for building platforms at scale with kubernetes.

Speaker bio

RaviShankar is a member of the Machine Learning Platform team at Walmart. He has ~14 years of experience in IT. He has completed his masters from BITS Pilani, and completed Executive Management Programme(EGMP) course from IIM,Bangalore. Before working with walmart, he has worked with IBM Labs and Yahoo. He has rich experience working at various levels of the application stack. In current portfolio, Ravi manages the end-to-end infrastructure of the Machine Learning platform.

Slides

https://www.slideshare.net/ravishankarks71/managing-infrastructure-for-machine-learning-platform-at-walmart-scale-using-kubernetes-as-the-backbone

The Fifth Elephant 2019