Sarthak Dev

@sarthakdev

Designing a Data Pipeline at Scale

Submitted Apr 15, 2019

At Freshworks, we deal with petabytes of data everyday. For our data science teams to read online data, run ETL jobs and push out relevant predictions in quick time, it’s imperative to run a strong and efficient data pipeline. In this talk, we’ll go through the best practices in designing and architecting such pipelines.

Outline

  • The role of a data engineer

    • Evaluation of role
    • Working with corresponding teams in detail
  • Architecture

    • Designing the data science pipeline
    • Feature engineering
    • Pre-processing
    • R vs Python vs Scala
    • Training vs Serving
  • Scale by design

    • Batch vs Stream
      • Leveraging streaming services (Kafka)
      • Dealing with online event data
      • Batch processing
    • Storage
      • Data-at-rest vs Working with real-time data
  • Building for Freshworks

    • Numbers
    • Complete architecture walkthrough
    • Scaling
  • A quick view of monitoring

    • Monitoring your ETL
    • Health of data
    • Optimising your alerts
      • Webhook alert systems

Requirements

Laptop

Speaker bio

I’ve been working as a Data Engineer at Freshworks for the last three years. Prior to that, I worked for four years at three early stage startups (including Airwoot) as a backend/data engineer.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}