What I learnt by running Apache Airflow @Scale
Submitted by Sreenath S Kamath (@sree92) on Monday, 26 March 2018
In the world of data-driven applications, the role played by workflow management system is unparalleled. At Qubole we use Apache Airflow to orchestrate our complex and time critical big data ETL jobs . Though Airflow has helped us tremendously there are certain areas where all the major workflow systems lack in lights out operations. .Following are some open questions that keep showing up on Airflow user and developer forums.
- How do we upgrade to a newer version of the ETL and how do we achieve continuous integration and continuous deploy for our ETL’s?
- How we do we make ETl’s aware of the Data Warehouse migrations ?
- How do we effectively manage the configurations of your ETL jobs when they are deployed across multiple environments ?
In this talk, I will
- Discuss the experiences of Data team at Qubole in using Apache Airflow as the workflow management system.
- Introduce DataApp tool that was developed to help with the operational challenges involved in managing a big data pipeline. DataApp is a tool under active development at Qubole and we plan to open source it soon.
This talk is targeted towards data engineers who use ETL on a day to day basis and have faced operational challenges in managing the ETL’s.
- Our Experience of using Apache Airflow
- Challenges in managing a set of ETL’s across multiple airflow installations.
- The journey of creating data app and how we went about solving the above mentioned challenges.
- Limitations & Future Work.
Sreenath is working with Qubole for over a year as a Data Engineer. He is mainly involved in setting up the data warehouse for the company with is powering the AIR(Alerts, Insights, and Recommendations) platform. He has an overall experience of over 4 years primarily in the ETL world.