Interactive data transformations at scale
Submitted by Abhilash L L (@abhilashll) on Friday, 29 April 2016
One set of ETL tools allows building ETL pipelines for large datasets, however these tools do not provide data-level interactivity. There’s another set of data-prep tools that allow interactive data transformations, however only for a single table (or for datasets that can fit in the memory of a single machine). The challenge is to provide the best of both worlds - interactive data transformations for very large datasets involving relational / algebraic operations, like Joins, Aggregations, etc. In this talk we will look at how we built an interactive / visual experience for very large datasets, that has been deployed at a large enterprise.
a) Context / Use case
b) Why its complex
High level Approaches
a) Hardware scaling
Why sampling is the right option
a) Sampling Techniques
b) How to come up with a sampling size
c) Fallbacks in case of failures
a) Make use of the relational / algebraic information for a more informed sampling strategy
Abhilash is currently a principal engineer in the Analytics team at Infoworks.io, an enterprise big data warehousing start up based out of the silicon valley. He has spent 7 years on scaling applications, loves start ups, open source and distributed systems. He started his exploration as an intern at Motorola Research Labs. He was an early stage team member at Capillary Technologies. In most of his tenure he has worked on complex OLTP and OLAP systems. He holds a masters degree from IIIT-Bangalore and a bachelors degree from RNSIT.