End-to-end automated data science process using Airflow.
Submitted by Keerthi Prasad (@keerthi17394) on Monday, 15 October 2018
Evive is a data driven benefit navigator. We provide our 25+ million users with personalised recommendations on their health and wealth. We have 50+ models running on a daily basis for the recommendations. We receive around 500+ gigabytes of data coming from 30+ different sources, on a daily basis.
As a part of the data science team, it is very important to validate this data at every transformation. The goal of the team is very simple : Integration, Validation, automation and modelling. There was a significant amount of time and resources spent even before we got into our core problem, i.e modelling. And the job doesn’t end at modelling. There is a series of tasks to be performed post modelling.
Airflow is our core infrastructure for data science life cycle. Airflow is used for automatic data fetching, data versioning, scheduling tasks , alerting, monitoring tasks and various modelling techniques. Along with this we use airflow to send targeted notifications. Different errors are handled by different members of the team. Airflow helps in channelising this flow.
In this talk, I’ll be presenting on how we set up the infrastructure, what are the various challenges we faced and how we went about solving them. Also, I’ll be discussing about how we used the general paradigms and principles of data pipelines to set up this system.
Intro to Evive and the data engineering team
Infrastructure and architecture
Airflow features incorporated
Challenges and solution
Data sanitization and reliability checks
The audience are not required to have any prerequisites on airflow. Basic understanding on data pipelines is required.
Keerthi is a graduate from NITK-Surathkal. He is working with Evive for 3 years as a Jr. Data Scientist. He is part of the data science team, building different Machine learning models at the same time setting up the required architecture for the team.