The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Designing a Data Pipeline at Scale

Submitted by Sarthak Dev (@sarthakdev) on Apr 15, 2019

Session type: Full talk of 40 mins Status: Rejected

Abstract

At Freshworks, we deal with petabytes of data everyday. For our data science teams to read online data, run ETL jobs and push out relevant predictions in quick time, it’s imperative to run a strong and efficient data pipeline. In this talk, we’ll go through the best practices in designing and architecting such pipelines.

Outline

  • The role of a data engineer

    • Evaluation of role
    • Working with corresponding teams in detail
  • Architecture

    • Designing the data science pipeline
    • Feature engineering
    • Pre-processing
    • R vs Python vs Scala
    • Training vs Serving
  • Scale by design

    • Batch vs Stream
      • Leveraging streaming services (Kafka)
      • Dealing with online event data
      • Batch processing
    • Storage
      • Data-at-rest vs Working with real-time data
  • Building for Freshworks

    • Numbers
    • Complete architecture walkthrough
    • Scaling
  • A quick view of monitoring

    • Monitoring your ETL
    • Health of data
    • Optimising your alerts
      • Webhook alert systems

Requirements

Laptop

Speaker bio

I’ve been working as a Data Engineer at Freshworks for the last three years. Prior to that, I worked for four years at three early stage startups (including Airwoot) as a backend/data engineer.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('You need to be a participant to comment.') }}

{{ formTitle }}
{{ gettext('Post a comment...') }}
{{ gettext('New comment') }}

{{ errorMsg }}