The Fifth Elephant 2016

India's most renowned data science conference

Interactive data transformations at scale

Submitted by Abhilash L L (@abhilashll) on Friday, 29 April 2016

videocam_off

Technical level

Beginner

Section

Sponsored

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +10

Abstract

One set of ETL tools allows building ETL pipelines for large datasets, however these tools do not provide data-level interactivity. There’s another set of data-prep tools that allow interactive data transformations, however only for a single table (or for datasets that can fit in the memory of a single machine). The challenge is to provide the best of both worlds - interactive data transformations for very large datasets involving relational / algebraic operations, like Joins, Aggregations, etc. In this talk we will look at how we built an interactive / visual experience for very large datasets, that has been deployed at a large enterprise.

Outline

Problem Stmt
a) Context / Use case
b) Why its complex

High level Approaches
a) Hardware scaling
b) Sampling

Why sampling is the right option
a) Sampling Techniques
b) How to come up with a sampling size
c) Fallbacks in case of failures

Future
a) Make use of the relational / algebraic information for a more informed sampling strategy

Speaker bio

Abhilash is currently a principal engineer in the Analytics team at Infoworks.io, an enterprise big data warehousing start up based out of the silicon valley. He has spent 7 years on scaling applications, loves start ups, open source and distributed systems. He started his exploration as an intern at Motorola Research Labs. He was an early stage team member at Capillary Technologies. In most of his tenure he has worked on complex OLTP and OLAP systems. He holds a masters degree from IIIT-Bangalore and a bachelors degree from RNSIT.

Comments

Login with Twitter or Google to leave a comment