Real time Machine Learning Inference Platform @ Zomato
Submitted by Nikunj Jain (@nikunj492) on Wednesday, 27 November 2019
Section: Full talk (40 mins) Category: Systems engineering Status: Confirmed & Scheduled
The main problem we were facing at Zomato that it takes 1-2 month to take a ML model live. Data scientists and ML engineers work on a variety of problems at Zomato such as predicting kitchen preparation time (time taken by the restaurant to prepare the food given the live order state of the kitchen), predicting rider assignment time (time to assign a free rider to pick up the order given the real time availability of riders), personalised ranking of the restaurants for a user etc. I will go in detail about the platform we made to cater to these use cases and which made it very easy for anyone to take a model live in less than a week.
- Learn about main components of real time ML Inference Platform
- Build a production ready real time ML Pipeline for low response time and high reliability at scale
- Build a platform where data scientists and engineers in the company can build and deploy ML models at scale using standardise workflow and deployment
Software Engineers, ML Engineers and DevOps Engineers
- Challenges and Problems at Zomato
- Requirements of the ML platform
- Overall Architecture of the ML Platform
- Case Study : Predicting Kitchen Preparation Time
- Real time feature computation pipeline - why did we choose Flink ?
- Platform for Data scientists to develop and log their models independently - why did we choose MLFlow ?
- Platform for model deployment - why AWS Sagemaker ?
- Realtime Feature Store backed by redis
- Non realtime Feature Store backed by cassandra
- ML Gateway to fetch features from Feature Store and call sagemaker for inference
- Workflow to deploy a new model
- Future work
I have been working with Data Scientists and ML engineers for more than 3 years at Zomato solving various user facing problems like personalized ranking, prediction kitchen preparation time, rider assignment time using machine learning. Being a software engineer at heart, I understand the problems being faced to make any real time complex machine learning model live in production at a scale. I have deployed all the above models facing 100k rpm at peak time.