How we build highly scalable and multi-tenant orchestration service using Apache Airflow on Kubernetes
We have different use cases which require some sort of workflow management and scheduling.Like there is use case to generate schedule reports. There are ML related use cases to author and manage multi-step workflows. There are ETLs jobs etc..
Currently teams are managing their own scheduler like cron or some workflow manager to meet these use cases. Some teams have also setup Apache Airflow to meet their requirements.
Our team started working on providing a fully managed, Highly scalable & Multi-Tenant Orchestration Service which can be used by different teams to meet their requirements.
In this talk I will cover how we solved different challenges faced while building a Managed, scalable & multi-tenant service using Apache Airflow.
Some of the high level requirements we considered while building orchestration service are
-> Abstraction of Apache Airflow from Users for ease of use.
-> Support for Dynamic on-demand & Static workflows.
-> Support for RestFul apis to author and manage workflows.
-> Support for different scheduling requirements like run only once or daily run.
-> Support to run 1000(s) of concurrent workflows.
-> Support to Airflow Operator Store which can be used by different teams.
-> Support to Customize Airflow config param like parallelism, dir refresh interval to meet different use cases.
We defined a JSON based DSL to simplify the multi-step workflow authoring process. Users can simply create a new workflow by invoking orchestration service’s REST api. Behind the scenes orchestration Service converts the JSON based workflow definition to Airflow Compatible Python DAG. It does all validations while generating a valid Airflow DAG.It also provides APIs to update/delete/read a workflow.
To meet high scale requirements we started using Apache Airflow on kubernetes but during our usage we observed that single Airflow cluster would not be sufficient to meet our required scale. In this talk I will cover how we solved high scale problem by setting up multiple Airflow clusters and by building right abstraction on top of it so that users remain agnostic of this.
In this journey we have also contributed multiple fixes to Apache Airflow in the areas of error handling, kubernetes Executor, Rest Apis, resiliency etc.
I am working as a Sr computer Scientist in Adobe and have been working on Orchestration Service from very begining. I Contributed to its design and played a key role in making it highly scalable on Apache Airflow. In this journey I have also made few contributions to Apache Airflow to improve its performance and resiliency