Submit a talk on data

Submit a talk on data

Submit talks on data engineering, data science, machine learning, big data and analytics through the year – 2019

##This space is open for submitting proposals on data engineering, data science, machine learning, big data and analytics through the year in 2019.

We will host data events round the year, in 2019. Talks for these conferences will be selected from here. Submit a proposal any time.

##Should you have queries, write to us on fifthelephant.editorial@hasgeek.com or call us on 7676332020

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Keerthi Prasad

@keerthi17394

End-to-end automated data science process using Airflow.

Submitted Oct 15, 2018

Evive is a data driven benefit navigator. We provide our 25+ million users with personalised recommendations on their health and wealth. We have 50+ models running on a daily basis for the recommendations. We receive around 500+ gigabytes of data coming from 30+ different sources, on a daily basis.

As a part of the data science team, it is very important to validate this data at every transformation. The goal of the team is very simple : Integration, Validation, automation and modelling. There was a significant amount of time and resources spent even before we got into our core problem, i.e modelling. And the job doesn’t end at modelling. There is a series of tasks to be performed post modelling.

Airflow is our core infrastructure for data science life cycle. Airflow is used for automatic data fetching, data versioning, scheduling tasks , alerting, monitoring tasks and various modelling techniques. Along with this we use airflow to send targeted notifications. Different errors are handled by different members of the team. Airflow helps in channelising this flow.

In this talk, I’ll be presenting on how we set up the infrastructure, what are the various challenges we faced and how we went about solving them. Also, I’ll be discussing about how we used the general paradigms and principles of data pipelines to set up this system.

Outline

Outline
Intro to Evive and the data engineering team
Problem Statement
Infrastructure and architecture
Airflow features incorporated
Challenges and solution
Data sanitization and reliability checks

Requirements

The audience are not required to have any prerequisites on airflow. Basic understanding on data pipelines is required.

Speaker bio

Keerthi is a graduate from NITK-Surathkal. He is working with Evive for 3 years as a Jr. Data Scientist. He is part of the data science team, building different Machine learning models at the same time setting up the required architecture for the team.

Slides

https://speakerdeck.com/keerthi/end-to-end-automated-data-science-process-using-airflow

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more