Plumbing data science pipelines
Submitted by Krishnapriya Satagopan (@kpsatagopan) on Monday, 22 May 2017
Section: Crisp talk for data engineering track Technical level: Intermediate
Data - There is a lot of it . But organizing it can be challenging, and analysis/consumption cannot begin until data is aggregated and massaged into compatible formats. These challenges grow more difficult as your dataset increases and as your needs approach the fabled “real time” status. Here, we’ll talk about how Python can be leveraged to collect data that is organized from many sources, standardized for analysis and consumption, and parallelized to scale with volume.
The topics covered will be Machine Learning, Pipelines and Monitoring. So, here we are going to look at an example of an ETL (Extract, Transform, Load) platform using Celery Pipelines and ELK Stack.
The talk begins with a brief of Machine Learning and the common problems faced. Then we progress further to explain how we tackled the machine learning problem using celery pipelines and monitoring strategies.
There will be a basic showcase from our ETL workflow and some dashboards to explain monitoring using the ELK stack (Elasticsearch, Logstash,Kibana) and Monit.
We will be learning and understanding the performances of the following tech stack.
Simple Queue Service (SQS)
Elastic Cache (Redis)
AWS technologies - Redshift, S3
ELK Stack (Elastic Search, Logstash, Kibana)
Krishnapriya (KP) is a hardcore Data Engineer with over 5 years of experience in the Data Engineering Space and the AWS stack. At Mad Street Den, she is part of the data science team and works closely with Data Scientists to build cost-effective cutting edge data products. She enables them to get their hands on all kinds of data sources in different forms and fidelities using scalable and robust data pipelines and workflows.