The Fifth Elephant 2016

India's most renowned data science conference


Looking under the hood - demystifying data tools

Submitted by Simrat Hanspal (@simrathanspal) on Friday, 17 June 2016

Section: Crisp talk Technical level: Intermediate

View proposal in schedule


The goal of this talk is to help build an understanding of the performances of the following packages -
R Dataframe
R data.table
PySpark RDDs
PySpark Dataframes
While these packages are operating in different but intersecting realms of use cases, depending on the cardinality of the data and the operations that will be performed on it, some are more suited than others for the task at hand. Before making the plunge into ‘Big Data’ it is important to understand the point at which one is trying to kill an ant with a sledgehammer. This talk outlines our attempts at grasping this. We will not evaluate a plethora of tools, just the ones that we considered for our requirements.


We will cover the design and development of experiments and present benchmark results across select tabular (eg.: join, aggregation etc.) and non-tabular operations (e.g. matrix multiplication, sort/search etc.). For further analysis the code will be open-sourced soon after the talk.

Speaker bio

Simrat is a Data Scientist, Engineering Ninja and Inspector Gadget at Mad Street Den. She builds data platforms and models to make sense of user and product data in e-commerce online retail.




  • Noriega (@noriega) 2 years ago (edited 2 years ago)

    Can you post any other link than your company website? Some slides or github page or your blog maybe?

    • Simrat Hanspal (@simrathanspal) Proposer 2 years ago

      Have upload draft slides, please note these are not complete and the github url will be realised soon.

Login with Twitter or Google to leave a comment