The Fifth Elephant 2019

The eighth edition of India's best data conference


Managing Infrastructure for Machine Learning Platform at Walmart scale - Using Kubernetes as the backbone

Submitted by Ravishankar Suribabu (@ravishankarks) on Wednesday, 27 March 2019

Preview video

Session type: Full talk of 40 mins


One of the most critical challenges in bringing Machine Learning to practice is to avoid the various technical debt traps which the data science teams focus on in their day to day jobs. Building a Machine Learning Platform at Walmart has a single agenda i.e. to make it easy for data scientists to use the company’s data to train/build new ML models at scale and making the “single click” deployment experience seamless – However, this experience is possible only by providing a robust infrastructure back-end for the platform.

I would like to share the learnings from setting up infrastructure for building the infrastructure back-end for Machine Learning Platform at Walmart. I would elaborate primarily on how we have used kubernetes as the container management solution for the platform. Key features such as - Dynamic Scaling in kubernetes, going hybrid-cloud with kubernetes, managing very large kubernetes clusters, managing security, resource scheduling and priority, CI/CD kubernetes deployment pipeline, supporting heterogeneous VMs in a single kubernetes cluster (both cpu/gpu), and kubernetes monitoring would be discussed

I would also evaluate our container management solution in comparison with Amazon EKS, Google GKE, Azure AKS and list out the challenges

This talk reflects our journey over the past 14 months – as we went through the journey – starting from a small infrastructure setup on private cloud to going hybrid with 4000+ cores of usage for ML workloads – yet keeping the various DevOps and infrastructure aspects abstracted from data scientists and our platform users


Brief Content Flow

Introduction [3 mins]
A short introduction to the topics that are to be covered. Beginning with the initial introduction to Machine Learning Platform, this section sets the base for discussing the the infrastructure needs in further slides

Infrastructure needs for Machine Learning Platform - Challenges [5 mins]
We will elaborate on the infrastructure needs to support such a platform

Deep-dive into platform infrastructure layers [8 mins]
We will go through a deep-dive into the infrastructure layers of the platform. We will look at the various tools we used, and the choices we had.
Topics covered:
Infrastructure Tech stack for container management

Learnings at Walmart scale [12 mins]
Here, we would extend the previous discussion to discuss in detail on how we the entire infrastructure back-end works on kubernetes. We will look at the below challenges/requirements and how we achieved them.
Topics covered: Important aspects of the learnings with kubernetes setup will be discussed:
managing very large kubernetes clusters,
dynamic scaling,
going hybrid-cloud,
managing security,
billing our tenants/users,
supporting heterogeneous workloads (both cpu/gpu), and

Lessons Learnt [5 mins]
Again, we discuss some interesting issues we got through - and how not resolving them in time can be trouble in waiting

Conclusion [3 mins]
With a summary of above points covered, a few concluding remarks will be presented.

Overall Q&A [5 mins]
Reserved for Q&A.


Knowledge on kubernetes and docker Technical stack is a must
Aimed at machine learning and DevOps enthusiasts who wish to get started with understanding infrastructure for building platforms at scale with kubernetes.

Speaker bio

RaviShankar is a member of the Machine Learning Platform team at Walmart. He has ~14 years of experience in IT. He has completed his masters from BITS Pilani, and completed Executive Management Programme(EGMP) course from IIM,Bangalore. Before working with walmart, he has worked with IBM Labs and Yahoo. He has rich experience working at various levels of the application stack. In current portfolio, Ravi manages the end-to-end infrastructure of the Machine Learning platform.


Preview video


  • Anwesha Sarkar (@anweshaalt) Reviewer 2 months ago

    Thank you for submitting the proposal. Submit your slides and preview video by 20th April (latest) it helps us to close the review process.

    • Ravishankar Suribabu (@ravishankarks) Proposer 2 months ago

      Please find the video uploaded as well.
      Kindly let know further action.

      • Ravishankar Suribabu (@ravishankarks) Proposer a month ago

        Kindly let know when we could hear a response from you folks ??

  • Zainab Bawa (@zainabbawa) Reviewer a month ago

    The proposal looks interesting, Ravishankar. Some questions:

    1. Why was there a need for kubernetes as the container management solution for the platform? The context for this seems to be missing.
    2. Participants at The Fifth Elephant want to learn about general problems, rather than company specific problems and solutions. Therefore, you either have to turn this talk around to share a war story about managing infrastructure for machine learning platform at Walmart scale: what was the initial problem? What were the challenges you encountered? How did the platform evolve at different stages in the ML model’s evolution? How did teams adjust to the various iterations of the platform? This will shift the focus away from the story of why/how Kubernetes forms the backbone of this platform to taking participants through the journey of the conceptualization, design and architecture, and evolution of the platform itself, which they may find interesting.
    3. Or, you have to turn the story around to Kubernetes being the backbone to explain: what was the problem which led you to choose Kubernetes? What other solutions did you evaluate before finalizing on Kubernetes? Share implementation details. What tradeoffs did you have to make with Kubernetes? Are there are use-cases and situations where making Kubernetes the backbone of the ML infrastructure will not work? How did the situation change – whether for the better or the worse – with this technical choice? Show us data about the before-after scenario: before adoption of Kubernetes and after adoption.

    We’ll need to see the revised slides by 27 May to make a decision on the proposal.

    • Ravishankar Suribabu (@ravishankarks) Proposer 23 days ago (edited 23 days ago)

      Thanks Zainab for the interesting thought-process. Makes sense.

      I have modified the content flow to suit the flow :
      point 1- Have added content to slides 5-7 to debate the tech stack choices across a plethora of design considerations. As we go deep into further slides.. the options do become a lot clear
      point 2 and 3 : Slides form 9-17 are reordered to create a evolution story of ML Platform at walmart, and how we got to use it better, and also the evolution of use-cases from our users.
      This also explains how we were able to keep the platform core robust, and yet keep adding/exploring new use-cases.

      Hope to hear from you soon.

  • Ravishankar Suribabu (@ravishankarks) Proposer 20 days ago

    Hi Zainab,
    appreciate update on this.

  • Ravishankar Suribabu (@ravishankarks) Proposer 13 days ago

    Kindly let know when we could hear a response from you folks ??

  • Ravishankar Suribabu (@ravishankarks) Proposer 10 days ago

    Hi Zainab,
    Kindly let know when we could hear a response from you folks ??

  • Ravishankar Suribabu (@ravishankarks) Proposer 6 days ago

    appreciate if you could let us know of any updates here.

  • Abhishek Balaji (@booleanbalaji) Reviewer 2 days ago

    Thanks, we’ve moved this to evaluation and will let you know the next steps.

Login with Twitter or Google to leave a comment