The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Plumbing data science pipelines

Submitted by Krishnapriya Satagopan (@kpsatagopan) on Monday, 22 May 2017

videocam
Preview video

Technical level

Intermediate

Section

Crisp talk for data engineering track

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +11

Abstract

Data - There is a lot of it . But organizing it can be challenging, and analysis/consumption cannot begin until data is aggregated and massaged into compatible formats. These challenges grow more difficult as your dataset increases and as your needs approach the fabled “real time” status. Here, we’ll talk about how Python can be leveraged to collect data that is organized from many sources, standardized for analysis and consumption, and parallelized to scale with volume.

The topics covered will be Machine Learning, Pipelines and Monitoring. So, here we are going to look at an example of an ETL (Extract, Transform, Load) platform using Celery Pipelines and ELK Stack.

Outline

The talk begins with a brief of Machine Learning and the common problems faced. Then we progress further to explain how we tackled the machine learning problem using celery pipelines and monitoring strategies.

There will be a basic showcase from our ETL workflow and some dashboards to explain monitoring using the ELK stack (Elasticsearch, Logstash,Kibana) and Monit.

We will be learning and understanding the performances of the following tech stack.

Celery
RabbitMQ
Simple Queue Service (SQS)
Elastic Cache (Redis)
AWS technologies - Redshift, S3
ELK Stack (Elastic Search, Logstash, Kibana)

Speaker bio

Krishnapriya (KP) is a hardcore Data Engineer with over 5 years of experience in the Data Engineering Space and the AWS stack. At Mad Street Den, she is part of the data science team and works closely with Data Scientists to build cost-effective cutting edge data products. She enables them to get their hands on all kinds of data sources in different forms and fidelities using scalable and robust data pipelines and workflows.

Slides

https://docs.google.com/a/madstreetden.com/presentation/d/15AlvFxbAeD-Dg4WWLO3zNjIlDzZpbiWzuhxV9foB90k/edit?usp=sharing

Preview video

https://youtu.be/gzVtJAmMN_E

Comments

  • 2
    Krishnapriya Satagopan (@kpsatagopan) Proposer a year ago

    Thanks for your feedback.
    1. Yes. This is the plan for the final presentation. We do plan to give our load testing numbers.
    2. We will only be covering our lessons learnt and experiences in the talk. The goal is to not cover operational experience platform by platform as it is widely available on the internet.
    3. These are just meant to be draft slides. We will be adding more content on the outline and slides as time permits.
    4. The plan is just to set context for the talk by introducing some of our AI and ML products in the retail domain that depend on our data stack.
    5. We are happy to discuss this offline or during Q&A but a comparison of true streaming vs mini-batch processing vs other tools and platforms is not the goal of the talk. These kind of benchmarks are widely available on the internet.

    We will discuss our lessons learnt in orchestrating and operationalizing a data platform based on a horizontally scaling broker-worker architecture with Python as a first-class citizen. Celery and RabbitMQ are battle-tested frameworks that fit our current needs.

  • 1
    Vinayak Hegde (@vin) a year ago

    There are many ways in which this talk can be improved:
    1. Mention scale with hard numbers and why it works at this scale. Load testing details if any.
    2. Compare this with alternatives with specific proces/cons and operational experience. Also cover failures modes and limitations.
    3. The slides are quite weak and could do with a lot more details of the problem trying to solve / solution with tradeoffs.
    4. Skip the machine learning part as there is not time for it in the crisp talk or make this a longer talk
    5. Add context why Real-time is important (quantify with it). Also why something like microbatching will not work (as with spark) or topologies with Storm ?

Login with Twitter or Google to leave a comment