The Fifth Elephant 2018

The seventh edition of India's best data conference

Business analytics on the cloud - a scalable model with R

Submitted by Praveen Chandrasekharan (@pchandra) on Sunday, 18 March 2018

videocam_off

Technical level

Intermediate

Section

Crisp talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +30

Abstract

“R” is a great language for data analysis which analysts love, but inherently difficult to scale because of its single threaded nature and lack of libaries/web frameworks. This talk is about how we overcame/worked around the limitations to plug R into a scalable cloud platform. It also talks about other design considerations which makes it practical to do analytics with larger datasets on a cloud paltform with point-and-click execution of functions

Outline

  1. Problem Statement : How do we build an analytics solutioning platform on the cloud with an R backend. Also how can we leverage pre-built R functions to enable point-and-click function execution over the browser, with the platform hosted on the cloud. The challenges include overcoming single threadedness of R, building efficiency in enabling point click analytic function execution on datasets on the browser and showing results to users, all in a performant manner
  2. Existing solutions/drawbacks : Microsoft R, Shiny etc
  3. Factors which influenced the solutioning : Preloading functions and horizontal scalability
  4. Queue based architecture with diagram
  5. Building message queue client in R
  6. Point Click Function execution details
  7. Preload functions and writing an orchestrator in R
  8. Input and output file delivery : Using cloud storage (like Azure File Store) mounted as local drive of R servers as well as Nginx web servers for output handling
  9. Big Data Processing using SparkR : Different path of SparkR clusters based on functions and data size
  10. Efficient mechanisms for showing datasets on the browser
  11. Wrapping Up : How above design considerations have helped achieve running analytics using R on the cloud over the browser

Requirements

https://www.youtube.com/edit?o=U&video_id=n8NlwkAyj5M

Speaker bio

I am sharing my experiences of building a cloud platform which was able to successfully address challenges like scaling R and processing large datasets on the cloud

Linkedin Profile : www.linkedin.com/in/praveencpillai

Slides

https://docs.google.com/presentation/d/1Vml_k_OXYo3Vp6Cby6PpeKhzDNlUrA3Z6IffH_pzHI4/edit#slide=id.p

Comments

  • 1
    Zainab Bawa (@zainabbawa) Reviewer 7 months ago

    We need a preview video to evaluate this talk.

  • 1
    Praveen Chandrasekharan (@pchandra) Proposer 7 months ago

    Have added the preview video (sorry for the delay,was travelling)

  • 1
    Venkata Pingali (@venkatapingali) 7 months ago

    This talk is a bit confusing. The limitations of R are well known. Given R’s popularity it is no surprise that a number of approaches are being experimented with including that handle IO, orchestration, and deployment. The talk should discuss the landscape and the need for a new approach/solution (even if closed source).

    1. Cloud versions of rstudio/shiny (http://www.shinyapps.io/)
    2. MPI integration along with workflow manager (https://cloud.google.com/solutions/running-r-at-scale)
    3. Distributed R(https://github.com/vertica/DistributedR)
  • 1
    Praveen Chandrasekharan (@pchandra) Proposer 7 months ago

    R server and shiny hosted versions are paid services . Snow/Rslurm/Rmpi options are interesting . Rslurm/Rmpi again seem to spawn new processes on participating nodes . The disadvantage here is initial load time of dependant packages and functions (few hundred extra milli seconds matter if we are looking to serve thousands of parallel executions over the web - the product we wre building was a platform for analytics training for individuals and corporates over the globe) .DistributedR,ParallellR and a host of other packages deal with the problem of more efficient execution of costly functions on large datasets by splitting the data in parallel threads etc . We tackled it using SparkR .
    Plugging into a queue based architecure while running a R cluster on the background is all about extending the asynchronous execution paradigm popular in other languages to R .I have added another slide to touch upon existing solutions,pls have a look(will try to refine it further)

Login with Twitter or Google to leave a comment