Rootconf Delhi edition

Rootconf Delhi edition

On network engineering, infrastructure automation and DevOps

Tickets
  • Select Tickets
  • Payment
  • invoice
  • Attendee details

Total ₹0

Cancellation and refund policy

Memberships can be cancelled within 1 hour of purchase

Workshop tickets can be cancelled or transferred upto 24 hours prior to the workshop.

For further queries, please write to us at support@hasgeek.com or call us at +91 7676 33 2020.

Nikunj Jain

@nikunj492

Real time Machine Learning Inference Platform @ Zomato

Submitted Nov 27, 2019

The main problem we were facing at Zomato that it takes 1-2 month to take a ML model live. Data scientists and ML engineers work on a variety of problems at Zomato such as predicting kitchen preparation time (time taken by the restaurant to prepare the food given the live order state of the kitchen), predicting rider assignment time (time to assign a free rider to pick up the order given the real time availability of riders), personalised ranking of the restaurants for a user etc. I will go in detail about the platform we made to cater to these use cases and which made it very easy for anyone to take a model live in less than a week.

Key takeaways:

  • Learn about main components of real time ML Inference Platform
  • Build a production ready real time ML Pipeline for low response time and high reliability at scale
  • Build a platform where data scientists and engineers in the company can build and deploy ML models at scale using standardise workflow and deployment

Target Audience:

Software Engineers, ML Engineers and DevOps Engineers

Outline

  • Challenges and Problems at Zomato
  • Requirements of the ML platform
  • Overall Architecture of the ML Platform
  • Case Study : Predicting Kitchen Preparation Time
  • Real time feature computation pipeline - why did we choose Flink ?
  • Platform for Data scientists to develop and log their models independently - why did we choose MLFlow ?
  • Platform for model deployment - why AWS Sagemaker ?
  • Realtime Feature Store backed by redis
  • Non realtime Feature Store backed by cassandra
  • ML Gateway to fetch features from Feature Store and call sagemaker for inference
  • Workflow to deploy a new model
  • Future work

Speaker bio

I have been working with Data Scientists and ML engineers for more than 3 years at Zomato solving various user facing problems like personalized ranking, prediction kitchen preparation time, rider assignment time using machine learning. Being a software engineer at heart, I understand the problems being faced to make any real time complex machine learning model live in production at a scale. I have deployed all the above models facing 100k rpm at peak time.

Comments

Login to leave a comment

  • Zainab Bawa

    @zainabbawa Editor & Promoter

    Thank you for the proposal, Nikunj. The current proposal reads heavily geared towards Machine Learning engineers and gives the talk an overall ML feel rather than a talk for operations engineers and software developers. It will help to emphasize why DevOps and software programmers should listen to this talk.

    Couple of questions that came up during the review were:

    1. What was the rationale for the overall architecture of Zomato's ML Platform? Which factors and decisions influence this choice of architecture?
    2. Why were Flink, MLFlow and SageMaker chosen as the building blocks for the ML platform? Which other tools/building blocks did you consider when building the data inference platform?
    3. Were there any cost, performance and other optimizations or trade-offs that resulted from choosing Flink, SageMaker and MLFlow?
    4. When you mention, "Build a platform where data scientists and engineers in the company can build and deploy ML models at scale using standardise workflow and deployment" -- were there any anti-patterns that you can share with developers trying to use this approach? What was compromised as a result of this approach? What adjustments did your team have to make to work with this platform?

    Look forward to your responses.

    Posted 5 years ago
    • NJ

      Nikunj Jain

      @nikunj492 Submitter

      Actually, the talk is not about machine learning algorithms. It is more for DevOps and software programmers to design reliable systems at scale which cater to the real time feature engineering and inference needs of machine learning in production.

      Answers to the questions:

      1. The main factors which went in while thinking of the design are:
        • ease of creation and deployment of models by data scientists with minimal support from engineering team
        • able to process real time events and produce features with minimal (less than 1 minute) lag
        • less than 100 ms p99 latency for inference
      2. Reasons to choose them are:
        • We need an open source platform which manages the entire lifecycle for machine learning models. There are two main projects around this i.e. MLFlow and KubeFlow. KubeFlow is entirely based on Kubernetes and since as a company, none of us has worked on Kubernetes, it will require a lot of extra effort in setting up that as compared to MLFlow which is very simple to setup. This was the main reason we picked MLFlow.
        • Sagemaker was chosen because of its ease of deployment, inbuilt auto scaling, logging, monitoring and able to take the new models live incrementally. We evaluated Elastic Beanstalk and ECS also but all of them requires a heavy work from engineering team. Plus, Mlflow provides direct integration with Sagemaker which is a big plus for us.
        • We need a real time streaming platform to calculate features on the fly. We evaluated Spark Streaming, Kafka Streams and Flink. Main reason to choose Flink was its strong community, ease to setup and job level isolation.
      3. One anti pattern which we saw was every team was trying to develop their own solutions to take the models live. Somebody starts using spark and cassandra and some team starts using python. Developers were trying to reinvent the wheel everytime. One main adjustment we have to make to work with this platform is a lot of delegation and coordination among multiple teams to take the model live.
      Posted 5 years ago
Hybrid access (members only)

Hosted by

We care about site reliability, cloud costs, security and data privacy