The Fifth Elephant round the year submissions for 2019

Submit a talk on data, data science, analytics, business intelligence, data engineering and ML engineering

How to make a kickass data platform with spark and S3

Submitted by Anshul Singhle on Jul 1, 2019

Session type: Full talk of 40 mins Status: Under evaluation

Abstract

In this talk, we will explore the advantages and challenges faced while running an in-house data platform using spark and S3. We will also discuss how to add some essential features to your platform like autoscaling and access control. The latter part of the talk will also address some ways to organise data in S3, storage formats for big data and indexing to improve read performance for big-data use cases. Overall the intention of this talk is to share the problems we faced while scaling our data platform and some of the solutions that worked for us.

Outline

  • Introduction to spark and S3
  • Essential features of a data platform
  • Autoscaling
  • Access Control
  • Storage formats for big data
  • Improving read performance of data in S3

Speaker bio

I have been working on big-data pipelines for the past 5 years, first at my startup, retention.ai , then later at inShorts. Currently working as backend engineer at Zendrive

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('You need to be a participant to comment.') }}

{{ formTitle }}
{{ gettext('Post a comment...') }}
{{ gettext('New comment') }}

{{ errorMsg }}