The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Tickets

Loading…

raman gupta

@ramandumcs

How we build highly scalable and multi-tenant orchestration service using Apache Airflow on Kubernetes

Submitted May 30, 2019

We have different use cases which require some sort of workflow management and scheduling.Like there is use case to generate schedule reports. There are ML related use cases to author and manage multi-step workflows. There are ETLs jobs etc..
Currently teams are managing their own scheduler like cron or some workflow manager to meet these use cases. Some teams have also setup Apache Airflow to meet their requirements.
Our team started working on providing a fully managed, Highly scalable & Multi-Tenant Orchestration Service which can be used by different teams to meet their requirements.
In this talk I will cover how we solved different challenges faced while building a Managed, scalable & multi-tenant service using Apache Airflow.

Some of the high level requirements we considered while building orchestration service are
-> Abstraction of Apache Airflow from Users for ease of use.
-> Support for Dynamic on-demand & Static workflows.
-> Support for RestFul apis to author and manage workflows.
-> Support for different scheduling requirements like run only once or daily run.
-> Support to run 1000(s) of concurrent workflows.
-> Support to Airflow Operator Store which can be used by different teams.
-> Support to Customize Airflow config param like parallelism, dir refresh interval to meet different use cases.

We defined a JSON based DSL to simplify the multi-step workflow authoring process. Users can simply create a new workflow by invoking orchestration service’s REST api. Behind the scenes orchestration Service converts the JSON based workflow definition to Airflow Compatible Python DAG. It does all validations while generating a valid Airflow DAG.It also provides APIs to update/delete/read a workflow.
To meet high scale requirements we started using Apache Airflow on kubernetes but during our usage we observed that single Airflow cluster would not be sufficient to meet our required scale. In this talk I will cover how we solved high scale problem by setting up multiple Airflow clusters and by building right abstraction on top of it so that users remain agnostic of this.
In this journey we have also contributed multiple fixes to Apache Airflow in the areas of error handling, kubernetes Executor, Rest Apis, resiliency etc.

Outline

https://www.slideshare.net/RamanGupta17/orchestration-service-v2

Speaker bio

I am working as a Sr computer Scientist in Adobe and have been working on Orchestration Service from very begining. I Contributed to its design and played a key role in making it highly scalable on Apache Airflow. In this journey I have also made few contributions to Apache Airflow to improve its performance and resiliency

Slides

https://www.slideshare.net/RamanGupta17/orchestration-service-v2

Comments

Login to leave a comment

  • AB

    Abhishek Balaji

    @booleanbalaji

    Hi Raman,

    Thank you for submitting a proposal. We need to see detailed slides and a preview video to evaluate your proposal. Your slides must cover the following:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/frameworks available in the market to solve this problem? How did you evaluate these, and what metrics did you use for the evaluation? Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how it changed after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?

    We need your updated slides and preview video by Jun 10, 2019 to evaluate your proposal. If we do not receive an update, we'd be moving your proposal for evaluation under a future event.

    Posted 5 years ago
Hybrid access (members only)

Hosted by

Jump starting better data engineering and AI futures