Running ML Workflows using Airflow @ Walmart
One of the most critical challenges in bringing Machine Learning to practice is to avoid the various technical debt traps which the data science teams focus on in their day to day jobs. Building a Machine Learning Platform at Walmart has a single agenda i.e. to make it easy for data scientists to use the company’s data to train/build new ML models at scale and making the deployment experience seamless.
The machine learning platform has many components like data connectors (for creating and managing data connections), notebooks (development time environment), workflows (to stich the notebooks together), models (save and load models), jobs (workflow batch and schedule jobs, distributed tensorflow and rapids jobs) and deployments (r shiny, python, tensor serve).
In this talk, I will cover the workflow components which is built on top of Apache Airflow and is one of the most used components in Walmart’s Machine Learning Platform. At the time of writing this, 1000+ DAGs are running in productions.
I would like to share the learning from setting up the airflow cluster in kubernetes and the workflow service written on top of this. From workflow, we are able to execute various batch jobs which includes launching distributed/non-distributed tensorflow jobs, distributed/non-distributed rapids jobs, Jupyter notebooks (python/Scala/spark), R studio jobs etc. The custom airflow plugins gives us capability to launch these notebooks/jobs. We have built a capability of launching parameterized notebooks/jobs using workflow. We have abstracted the complete workflow creation part by providing a GUI to create the workflow definition (DAG) and internally generating the python code to create the Airflow DAGs.
While building these components, the goal was to provide a platform where user can create notebooks and stich these parameterized notebooks together using a GUI based workflow. The workflow creation process is simple drag and drop of various notebook types and allows to set the local and global parameters. It also allows to pass the values from one notebook to another notebook in a workflow. I would elaborate primarily on how we have built the workflow system and how it interact with the notebook system to schedule the notebooks.
This talk reflects our journey over the past 1.5 years – as we went through the journey – starting from a just one notebook type and simple workflow to a system which supports a workflow system which includes operators to execute Jupyter notebooks, R studio notebooks, distributed/non-distributed rapids and tensorflow jobs.
Overview - Machine Learning Platform @ Walmart
Workflow - Requirements
What options we consider for workflow framework?
Workflow – Dag Designer
High Level Architecture: Workflow Service with Airflow Cluster (Celery Executor)
Configure Airflow for High Performance
High Level Architecture: Workflow Notebook Integration
airflow, kubernetes basics
Sachin Parmar is a Senior Architect with the GDAP (Global Data and Analytics Platforms) Group in Walmart Labs. From past 2 years, Sachin is member of the team which builds Machine Learning Platform for Walmart Labs. At Walmart, Sachin is responsible for building components like workflow and tools on top of tensorflow for Machine Learning Platform. Before working with Walmart, Sachin has worked with Yahoo! for around 8.5 years and with IBM Labs around 1.5 years.