The Fifth Elephant 2016

India's most renowned data science conference

Up next

Looking under the hood - demystifying data tools


Simrat Hanspal


The goal of this talk is to help build an understanding of the performances of the following packages -
R Dataframe
R data.table
PySpark RDDs
PySpark Dataframes
While these packages are operating in different but intersecting realms of use cases, depending on the cardinality of the data and the operations that will be performed on it, some are more suited than others for the task at hand. Before making the plunge into ‘Big Data’ it is important to understand the point at which one is trying to kill an ant with a sledgehammer. This talk outlines our attempts at grasping this. We will not evaluate a plethora of tools, just the ones that we considered for our requirements.


We will cover the design and development of experiments and present benchmark results across select tabular (eg.: join, aggregation etc.) and non-tabular operations (e.g. matrix multiplication, sort/search etc.). For further analysis the code will be open-sourced soon after the talk.

Speaker bio

Simrat is a Data Scientist, Engineering Ninja and Inspector Gadget at Mad Street Den. She builds data platforms and models to make sense of user and product data in e-commerce online retail.