The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Tickets

Designing a Data Pipeline at Scale

Submitted by Sarthak Dev (@sarthakdev) on Monday, 15 April 2019

Session type: Full talk of 40 mins

Abstract

At Freshworks, we deal with petabytes of data everyday. For our data science teams to read online data, run ETL jobs and push out relevant predictions in quick time, it’s imperative to run a strong and efficient data pipeline. In this talk, we’ll go through the best practices in designing and architecting such pipelines.

Outline

  • The role of a data engineer

    • Evaluation of role
    • Working with corresponding teams in detail
  • Architecture

    • Designing the data science pipeline
    • Feature engineering
    • Pre-processing
    • R vs Python vs Scala
    • Training vs Serving
  • Scale by design

    • Batch vs Stream
      • Leveraging streaming services (Kafka)
      • Dealing with online event data
      • Batch processing
    • Storage
      • Data-at-rest vs Working with real-time data
  • Building for Freshworks

    • Numbers
    • Complete architecture walkthrough
    • Scaling
  • A quick view of monitoring

    • Monitoring your ETL
    • Health of data
    • Optimising your alerts
      • Webhook alert systems

Requirements

Laptop

Speaker bio

I’ve been working as a Data Engineer at Freshworks for the last three years. Prior to that, I worked for four years at three early stage startups (including Airwoot) as a backend/data engineer.

Comments

  • Zainab Bawa (@zainabbawa) Reviewer 7 months ago

    Sarthak, we haven’t received the draft slides and preview video for your proposal. Slides and preview video have to be uploaded here, by or before 10 May.
    Also, the following points came up in the review which you have to respond to:

    1. How your proposal is different from this proposal: https://hasgeek.com/fifthelephant/2019/proposals/10-steps-to-build-your-own-data-pipeline-for-day-1-XzaQns5CSFrdp9uFEDmHrn?
    2. The current proposal only describes the data pipelines at Freshworks. What is the takeaway for the audience beyond this description? For example, you have to explain how you arrived at this approach of building data pipelines? What are the patterns and anti-patterns you have discovered in the process of building data pipelines at this scale? Either focus on the approach or on the learnings in the process. Description of the solution becomes uninteresting unless tied to a clear insight.

Login with Twitter or Google to leave a comment