Rootconf Delhi edition

On network engineering, infrastructure automation and DevOps


Real time Machine Learning Inference Platform @ Zomato

Submitted by Nikunj Jain (@nikunj492) on Wednesday, 27 November 2019

Section: Full talk (40 mins) Category: Systems engineering Status: Confirmed & Scheduled

View proposal in schedule


The main problem we were facing at Zomato that it takes 1-2 month to take a ML model live. Data scientists and ML engineers work on a variety of problems at Zomato such as predicting kitchen preparation time (time taken by the restaurant to prepare the food given the live order state of the kitchen), predicting rider assignment time (time to assign a free rider to pick up the order given the real time availability of riders), personalised ranking of the restaurants for a user etc. I will go in detail about the platform we made to cater to these use cases and which made it very easy for anyone to take a model live in less than a week.

Key takeaways:

  • Learn about main components of real time ML Inference Platform
  • Build a production ready real time ML Pipeline for low response time and high reliability at scale
  • Build a platform where data scientists and engineers in the company can build and deploy ML models at scale using standardise workflow and deployment

Target Audience:

Software Engineers, ML Engineers and DevOps Engineers


  • Challenges and Problems at Zomato
  • Requirements of the ML platform
  • Overall Architecture of the ML Platform
  • Case Study : Predicting Kitchen Preparation Time
  • Real time feature computation pipeline - why did we choose Flink ?
  • Platform for Data scientists to develop and log their models independently - why did we choose MLFlow ?
  • Platform for model deployment - why AWS Sagemaker ?
  • Realtime Feature Store backed by redis
  • Non realtime Feature Store backed by cassandra
  • ML Gateway to fetch features from Feature Store and call sagemaker for inference
  • Workflow to deploy a new model
  • Future work

Speaker bio

I have been working with Data Scientists and ML engineers for more than 3 years at Zomato solving various user facing problems like personalized ranking, prediction kitchen preparation time, rider assignment time using machine learning. Being a software engineer at heart, I understand the problems being faced to make any real time complex machine learning model live in production at a scale. I have deployed all the above models facing 100k rpm at peak time.



  •   Zainab Bawa (@zainabbawa) Reviewer 2 months ago

    Thank you for the proposal, Nikunj. The current proposal reads heavily geared towards Machine Learning engineers and gives the talk an overall ML feel rather than a talk for operations engineers and software developers. It will help to emphasize why DevOps and software programmers should listen to this talk.

    Couple of questions that came up during the review were:

    1. What was the rationale for the overall architecture of Zomato’s ML Platform? Which factors and decisions influence this choice of architecture?
    2. Why were Flink, MLFlow and SageMaker chosen as the building blocks for the ML platform? Which other tools/building blocks did you consider when building the data inference platform?
    3. Were there any cost, performance and other optimizations or trade-offs that resulted from choosing Flink, SageMaker and MLFlow?
    4. When you mention, “Build a platform where data scientists and engineers in the company can build and deploy ML models at scale using standardise workflow and deployment” – were there any anti-patterns that you can share with developers trying to use this approach? What was compromised as a result of this approach? What adjustments did your team have to make to work with this platform?

    Look forward to your responses.

    •   Nikunj Jain (@nikunj492) Proposer 2 months ago

      Actually, the talk is not about machine learning algorithms. It is more for DevOps and software programmers to design reliable systems at scale which cater to the real time feature engineering and inference needs of machine learning in production.

      Answers to the questions:
      1. The main factors which went in while thinking of the design are:
      * ease of creation and deployment of models by data scientists with minimal support from engineering team
      * able to process real time events and produce features with minimal (less than 1 minute) lag
      * less than 100 ms p99 latency for inference
      2. Reasons to choose them are:
      * We need an open source platform which manages the entire lifecycle for machine learning models. There are two main projects around this i.e. MLFlow and KubeFlow. KubeFlow is entirely based on Kubernetes and since as a company, none of us has worked on Kubernetes, it will require a lot of extra effort in setting up that as compared to MLFlow which is very simple to setup. This was the main reason we picked MLFlow.
      * Sagemaker was chosen because of its ease of deployment, inbuilt auto scaling, logging, monitoring and able to take the new models live incrementally. We evaluated Elastic Beanstalk and ECS also but all of them requires a heavy work from engineering team. Plus, Mlflow provides direct integration with Sagemaker which is a big plus for us.
      * We need a real time streaming platform to calculate features on the fly. We evaluated Spark Streaming, Kafka Streams and Flink. Main reason to choose Flink was its strong community, ease to setup and job level isolation.
      3. One anti pattern which we saw was every team was trying to develop their own solutions to take the models live. Somebody starts using spark and cassandra and some team starts using python. Developers were trying to reinvent the wheel everytime. One main adjustment we have to make to work with this platform is a lot of delegation and coordination among multiple teams to take the model live.

  •   Anwesha Sarkar (@anweshaalt) Reviewer 2 months ago

    Hello Nikunj,

    I have the following questions/concerns for your proposal:

    • How do you think that the learning from the “problems of Zomato” will be helpful for the people who are not in the similar kind of business. Specially while you are dealing with the example of “Predicting Kitchen Preparation Time” as your case study, not sure how it will be solution for otherwise.
    • Can you speacify the key learnings and take aways for the attendees?
    • Can you incorporate your replies to Zainab’s concern in your proposal to give a clearer view to the audience what to expect from the proposal?

    Look forward to your reply.


    •   Anwesha Sarkar (@anweshaalt) Reviewer 2 months ago

      Hello Nikunj,

      Submit your response to the feedback by 4th December, so we can close the decision on the proposal.


    •   Nikunj Jain (@nikunj492) Proposer 2 months ago

      Hi Anwesha,

      Following are the replies:

      1. I think every other company which deals with deploying ML models will be facing the problems which we face at Zomato and people can easily relate with that. Even for the people who are not in deploying ML models, I think they can relate the problems with other non ML related problems also as these problems are quite generic in nature. Also, I am not going to talk things very specific to Predicting Kitchen Preparation Time. I am just taking it as an example so that people can understand things in the practical world better. Even a non Tech guy would be able to relate it.

      2. I have already specified key takeaways in the proposal itself.

      3. I have made changes in the proposal keeping in view of the above replies.

Login with Twitter or Google to leave a comment