The Fifth Elephant 2016

India's most renowned data science conference

Simrat Hanspal

Simrat Hanspal

@simrathanspal

Looking under the hood - demystifying data tools

Submitted Jun 17, 2016

The goal of this talk is to help build an understanding of the performances of the following packages -
R Dataframe
R data.table
Pandas
Numpy
PySpark RDDs
PySpark Dataframes
RedShift
While these packages are operating in different but intersecting realms of use cases, depending on the cardinality of the data and the operations that will be performed on it, some are more suited than others for the task at hand. Before making the plunge into ‘Big Data’ it is important to understand the point at which one is trying to kill an ant with a sledgehammer. This talk outlines our attempts at grasping this. We will not evaluate a plethora of tools, just the ones that we considered for our requirements.

Outline

We will cover the design and development of experiments and present benchmark results across select tabular (eg.: join, aggregation etc.) and non-tabular operations (e.g. matrix multiplication, sort/search etc.). For further analysis the code will be open-sourced soon after the talk.

Speaker bio

Simrat is a Data Scientist, Engineering Ninja and Inspector Gadget at Mad Street Den. She builds data platforms and models to make sense of user and product data in e-commerce online retail.

Slides

https://docs.google.com/presentation/d/1djF_9bUfmCQT98r-nz152-e_tSzzKNDZ87jUgirLi7s/edit?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures