The Fifth Elephant 2016

India's most renowned data science conference

Abhilash L L


Interactive data transformations at scale

Submitted Apr 29, 2016

One set of ETL tools allows building ETL pipelines for large datasets, however these tools do not provide data-level interactivity. There’s another set of data-prep tools that allow interactive data transformations, however only for a single table (or for datasets that can fit in the memory of a single machine). The challenge is to provide the best of both worlds - interactive data transformations for very large datasets involving relational / algebraic operations, like Joins, Aggregations, etc. In this talk we will look at how we built an interactive / visual experience for very large datasets, that has been deployed at a large enterprise.


Problem Stmt
a) Context / Use case
b) Why its complex

High level Approaches
a) Hardware scaling
b) Sampling

Why sampling is the right option
a) Sampling Techniques
b) How to come up with a sampling size
c) Fallbacks in case of failures

a) Make use of the relational / algebraic information for a more informed sampling strategy

Speaker bio

Abhilash is currently a principal engineer in the Analytics team at, an enterprise big data warehousing start up based out of the silicon valley. He has spent 7 years on scaling applications, loves start ups, open source and distributed systems. He started his exploration as an intern at Motorola Research Labs. He was an early stage team member at Capillary Technologies. In most of his tenure he has worked on complex OLTP and OLAP systems. He holds a masters degree from IIIT-Bangalore and a bachelors degree from RNSIT.


{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

All about data science and machine learning