The Fifth Elephant 2012

Finding the elephant in the data.

Joydeep Sen Sarma

@jsensarma

The Elephant in the Cloud

Submitted May 23, 2012

How do you build a big data service in the Cloud? How can we make queries against relatively slow Cloud Storage Systems fast? How can we take real advantage of the elasticity available in the Cloud? How do you make the Cloud dead easy to use for big data processing?

At Qubole we have been searching for answers to these questions and would love to share what we have discovered and built.

Outline

Hadoop and frameworks on top of it like Hive are a popular application running in the Cloud. The Cloud architecture though is significantly different - in terms of it’s elasticity, it’s latency characteristics and it’s pricing models than a regular data center. It can also be daunting to a lay user to understand and setup. In this talk we will describe how Qubole Data Service has adapted Hadoop and Hive to uniquely fit and exploit the Cloud architecture and make big data processing easy and accessible to all. The agenda will be roughly as follows:

  1. Start by covering some key characteristics of the Cloud.
  2. Describe the current state of art of running Big Data stack in the Cloud and the problems and opportunities for improvement in the above.
  3. Describe Qubole Architecture and how we have attempted to tackle some of these problems.
  4. Demonstrate some of the usability enhancements and go over some performance comparisons.

Speaker bio

Joydeep is a co-founder at Qubole and heads their India development team. Prior to starting Qubole - Joydeep worked at Facebook where he boot-strapped the data processing ecosystem based on Hadoop, started the Apache Hive project and led the Data Infrastructure team. Joydeep was a key contributor on the Facebook Messages architecture team that brought Apache HBase to Facebook and to the transactional and reporting backends for Facebook Credits. He has been a driver for other important sub-projects in the Hadoop ecosystem - like the FairScheduler and RCFile. Joydeep studied Computer Science at IIT-Delhi and University of Pittsburgh and started his career working on Oracle’s database kernel and building highly available and scalable file systems at Netapp. In between - he has played founding roles in storage and advertising startups. He cut his teeth building data driven applications as the lead engineer on Yahoo’s in-house Recommendation Platform.

Joydeep holds numerous patents, has many published papers and has been both speaker and panelist at Hadoop summits and at other Silicon Valley conferences.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures