The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

Big Data Analytics with R

Submitted by Neeta Pande (@neetapande) on Tuesday, 23 April 2013

videocam_off

Technical level

Intermediate

Section

Analytics and Visualization

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +21

Objective

An attendee would understand High Performance and Parallel Computing landscape in R. This area in R is undergoing rapid change and objective of this session is to provide insight into various active contributions in this area. In the session, we would also delve deeper into analyzing moderately large data sets which presents huge opportunity today as a solution to "everything in memory" challenge in R without getting into huge infrastructure/software setup or costs.

Description

When we hear about Parallelism and Big Data Processing in R, we think of Grid Computing or Parallel computing with Hadoop or Revolution Analytics which requires infrastructure setup and typically skillset/programming beyond R. These may be required for analyzing really big data sets (terabytes+). However for handling data up to few hundreds of GB, there are packages like ff and bigmemory in R, which can solve large number of use cases without the need of additional memory or hardware setup. These techniques though useful are not very well known and are primary focus of this session.

Speaker bio

Neeta Pande, Data Architect, Intuit: Neeta has about13 years of experience in Business Intelligence and Analytics. She has extensive experience architecting and engineering data analytics in BFSI, manufacturing and personal finance domain. Her recent focus area includes usage behavior analysis, real time customer behavior prediction/contextual personalization service platforms and designing scalable and sustainable technology platform for solving big data problems.

Comments

  • 2
    t3rmin4t0r (@t3rmin4t0r) 5 years ago

    Bigmemory might new/unusual for R users, but using MMAP to load up data for analysis is pretty much standard operating procedure for C++ folks. I don't write any more statistical analytics in C++ anymore (numpy has a nice memmap() for me to use), but it is a pretty simple jump from reading everything into heap.

    What has gotten me excited recently about R was that R/Hadoop enables reading data streams off HDFS (which has become the de-facto data land-fill for logging data). And with the additional ability to run R as a map-reduce program on truly big data sets.

    The talk is stuck between "small data that fits in memory" and "medium data that fits on a RAID volume". This needs an extra step forward to really handle big data, which is really a large warehouse of terabytes, but might be only a few gb per metric for a normal analytics time window (a day? a week?) - might fit with-in a single R task even.

    I would be very interested in learning more about R at those scales.

Login with Twitter or Google to leave a comment