How to make a kickass data platform with spark and S3
In this talk, we will explore the advantages and challenges faced while running an in-house data platform using spark and S3. We will also discuss how to add some essential features to your platform like autoscaling and access control. The latter part of the talk will also address some ways to organise data in S3, storage formats for big data and indexing to improve read performance for big-data use cases. Overall the intention of this talk is to share the problems we faced while scaling our data platform and some of the solutions that worked for us.
- Introduction to spark and S3
- Essential features of a data platform
- Access Control
- Storage formats for big data
- Improving read performance of data in S3
I have been working on big-data pipelines for the past 5 years, first at my startup, retention.ai , then later at inShorts. Currently working as backend engineer at Zendrive