The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Tickets

Running ML Workflows using Airflow @ Walmart

Submitted by Sachin Parmar (@sachinparmar) on Saturday, 6 April 2019


Preview video

Session type: Full talk of 40 mins

Abstract

One of the most critical challenges in bringing Machine Learning to practice is to avoid the various technical debt traps which the data science teams focus on in their day to day jobs. Building a Machine Learning Platform at Walmart has a single agenda i.e. to make it easy for data scientists to use the company’s data to train/build new ML models at scale and making the deployment experience seamless.

The machine learning platform has many components like data connectors (for creating and managing data connections), notebooks (development time environment), workflows (to stich the notebooks together), models (save and load models), jobs (workflow batch and schedule jobs, distributed tensorflow and rapids jobs) and deployments (r shiny, python, tensor serve).

In this talk, I will cover the workflow components which is built on top of Apache Airflow and is one of the most used components in Walmart’s Machine Learning Platform. At the time of writing this, 1000+ DAGs are running in productions.

I would like to share the learning from setting up the airflow cluster in kubernetes and the workflow service written on top of this. From workflow, we are able to execute various batch jobs which includes launching distributed/non-distributed tensorflow jobs, distributed/non-distributed rapids jobs, Jupyter notebooks (python/Scala/spark), R studio jobs etc. The custom airflow plugins gives us capability to launch these notebooks/jobs. We have built a capability of launching parameterized notebooks/jobs using workflow. We have abstracted the complete workflow creation part by providing a GUI to create the workflow definition (DAG) and internally generating the python code to create the Airflow DAGs.

While building these components, the goal was to provide a platform where user can create notebooks and stich these parameterized notebooks together using a GUI based workflow. The workflow creation process is simple drag and drop of various notebook types and allows to set the local and global parameters. It also allows to pass the values from one notebook to another notebook in a workflow. I would elaborate primarily on how we have built the workflow system and how it interact with the notebook system to schedule the notebooks.

This talk reflects our journey over the past 1.5 years – as we went through the journey – starting from a just one notebook type and simple workflow to a system which supports a workflow system which includes operators to execute Jupyter notebooks, R studio notebooks, distributed/non-distributed rapids and tensorflow jobs.

Outline

Agenda:
Overview - Machine Learning Platform @ Walmart
Workflow - Requirements
What options we consider for workflow framework?
Workflow – Dag Designer
High Level Architecture: Workflow Service with Airflow Cluster (Celery Executor)
Configure Airflow for High Performance
High Level Architecture: Workflow Notebook Integration
Key Learnings

Requirements

airflow, kubernetes basics

Speaker bio

Sachin Parmar is a Senior Architect with the GDAP (Global Data and Analytics Platforms) Group in Walmart Labs. From past 2 years, Sachin is member of the team which builds Machine Learning Platform for Walmart Labs. At Walmart, Sachin is responsible for building components like workflow and tools on top of tensorflow for Machine Learning Platform. Before working with Walmart, Sachin has worked with Yahoo! for around 8.5 years and with IBM Labs around 1.5 years.

Slides

https://drive.google.com/file/d/19Jcd6VhFgoONWcvVpgcHHkJuIg6ITpW1/view?usp=sharing

Preview video

https://youtu.be/QsZjCyxx4Zg

Comments

  • Amit Ghosh (@aghosh) 5 months ago

    looking forward to your presentation…sounds interesting.

  • Anwesha Sarkar (@anweshaalt) Reviewer 7 months ago

    Thank you for submitting the proposal. Submit your slides and preview video by 20th April (latest) it helps us to close the review process.

  • Zainab Bawa (@zainabbawa) Reviewer 6 months ago

    Here is the feedback on the draft slides:

    1. What is the problem that is being solved with this tooling? Why is this problem important for participants who don’t have the scale, use case and challenges as that of Walmart’s? The problem statement has to be general, and not specific to a company’s problems.
    2. There is a slide mentioning workflow options comparison, but it has no information of what metrics/parameters was the comparison done with and why Airflow emerged as a superior option?
    3. Show before-after scenarios with data points.
    4. Explain how this workflow was adapted and integrated into the existing system, and what were the challenges faced during the implementation stages? What were the points of failure and trade-offs?
    5. The workflow architecture is very high-level. You have to show details and also explain why this architecture was chosen? What are the costs of implementing this architecture?
    6. The key learnings’ slides are only describing outcomes for Walmart. They don’t have anything concrete in terms of learnings for participants, such as patterns for problem solving.

    Share revised slides incorporating the above feedback by or before 21 May to close the decision on your proposal.

  • Sachin Parmar (@sachinparmar) Proposer 5 months ago (edited 5 months ago)

    I have added my comments. Please let me know if additional information is needed.

    1.What is the problem that is being solved with this tooling? Why is this problem important for participants who don’t have the scale, use case and challenges as that of Walmart’s? The problem statement has to be general, and not specific to a company’s problems.

    -A tool to simplify data science pipeline for training and prediction. Tool has a very simple dag designer interface which can be used by data scientist, ML engineer to quickly setup data science pipeline.
    They can take their design time artefacts and setup production grade pipeline using this tool.
    -Building data science pipeline is two-step process in most of the company where data scientist build training/prediction code and then they give the work to data engineer to create a workflow.
    We are giving power of airflow in the hand of data scientist and removing extra setup and understanding needed to build ML pipelines. We are abstracting the complexity of pipelines, alert, monitoring, sla, retry and providing this as managed service. Users can leverage all these feature with very minimal configuration.
    -Data science pipeline are need of hour when most of the companies are adopting data science and bringing it mainstream of their ecosystem, pipeline and driving critical KPIs using ML models.
    -Every company is in need to reduce the time between training model to deploy production grade pipeline for continuous training/prediction. Similar approach can help multiple organization to expedite their ML work irrespective of their scale ecosystem.

    2.There is a slide mentioning workflow options comparison, but it has no information of what metrics/parameters was the comparison done with and why Airflow emerged as a superior option?

    Will add additional slide for this.

    3.Show before-after scenarios with data points.

    Before this platform building ML pipelines were adhoc work across companies. Different teams were building their own solutions. We have multiple schedulers and pipeline tools within company. It use to take lots of time to rewrite ML code to fit in the format required for the workflow tools. Currently we have 50+ production workflow running using this platform. We have observed 30-70% time to market improvement on these projects. We able to bring pipeline development effort from weeks to days.

    4.Explain how this workflow was adapted and integrated into the existing system, and what were the challenges faced during the implementation stages? What were the points of failure and trade-offs?

    Element is centralized machine learning platform within Walmart. Which is used across multiple pillar for various data science projects. Workflow/Pipeline is key component of the platform and enables user to setup production pipelines based on their workspace/notebook. It significantly reduces the time for moving ML work from design to production from weeks to days. Also reduce the extra rework to move design time code to final production pipeline.
    Challenges :
    - Building a multitenant ML pipeline for Walmart brings its own challenges due to diversity of runtimes , use-cases and scale.
    - Scheduling delay, DagBag optimization, we scaled number of works , HA of Airflow, too many concurrent workload across Walmart, added rest API support
    - Scheduler in Airflow is single point of failure. We are able to mitigate this limitation of airflow as were using kubernetes as cluster manager. It maintenance minimum pod replica. In case failure it brings another instance of scheduler in no time reducing failures and downtime.
    Each task takes one slot in workflow worker, this becomes challenge if you have too many long running tasks. As you will be limited by number of slots, second it will cause starvation for waiting jobs. We scaled number of worker to have more concurrent tasks. We are also working on worker autoscaling based on the load and number of pending tasks.

    5.The workflow architecture is very high-level. You have to show details and also explain why this architecture was chosen? What are the costs of implementing this architecture?

    Airflow by itself is pretty open for end-user to have their own setup for HA and fault tolerance. I have added detailed architecture of How Walmart has setup Airflow for multitenancy , High availability and fault tolerance. This architecture has achieved all these aspect of a multitenant workflow engine using various open source technologies. No additional cost involved as we have used opensource components like k8s, Redis, GlusterFS, RabbitMQ. Cost is only for compute VMs running the workflow components.

    The slides cover Airflow architecture and how on top of it Walmart machine learning platform architecture was built. It will be explained in with low level details on how each component works and interacts with each other. I have 3 slides on the architecture diagram which covers each and every component of the system. Please let me know if there is any specific diagram that you feel need more information.

    6.The key learnings’ slides are only describing outcomes for Walmart. They don’t have anything concrete in terms of learnings for participants, such as patterns for problem solving.

    Key Learning :
    - Having two steps process for data science between building model by data scientist and scaling and deploying as pipeline by data engineers, devops is not efficient . It delays the data science projects and also incurs extra cost and brings multiple adhoc solutions. The best way to avoid this for organization is having managed platform to build and deploy ml pipelines.
    - It should abstract complexity and give the power to the data scientist to build and manage their own pipelines. This will enable them running multiple experiments and iterating faster to achieve optimal ml models.
    - Airflow is very powerful workflow tool and being opensource it enabled easy integration with ML platform. But opensource brings its own challenges and one should be ready to invest on the platform and ready to go to the code and build features which are necessary for your use cases. Feature which is critical for your organization may not be community priority and you should able to build it for your own and contribute to the community. Airflow architecture is very flexible and it provides plugin interface to extend its capability.
    - Multitenancy at Walmart scale itself brings its own challenges. We had to extend capabilities of airflow for security, scalability, centralized logging etc.
    - Apart from that there are key learning related to Airflow Cluster set up

    I will update the slide today and upload it.

  • Sachin Parmar (@sachinparmar) Proposer 5 months ago

    Hi Zainab,
    Appreciate update on this.
    Thanks.

  • Sachin Parmar (@sachinparmar) Proposer 5 months ago

    Hi Zainab,
    Appreciate update from your side on this propsal.
    Thanks.

Login with Twitter or Google to leave a comment