The Fifth Elephant 2014

A conference on big data and analytics

Hive and Presto for Big Data Analytics in the Cloud

Submitted by Vikram Agrawal (@vikram) on Tuesday, 20 May 2014

videocam_off

Technical level

Intermediate

Section

Full talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +19

Objective

The objective of this talk is to conceptualize the use of Hive and Presto for big data analytics. We will contrast their architecture and use cases, and describe how to take advantage of both these technologies in the cloud.

Description

A big data project typically entails processing terabytes to petabytes of data to produce actionable reports and generate business insights. With the advent of public clouds, it is extremely easy to provision machines for analytic workflows as per usage. Open source projects such as Hadoop, Hive and Presto provides inexpensive big data software to develop such projects and have become valuable tools for data integration and analysis. These technologies are production-ready and are running at scale in organizations like Yahoo and Facebook.

Hive provides a massive, fault-tolerant , data warehouse for ad-hoc querying and analysis of very large distributed datasets. Presto on the other hand is emerging as an alternative to Hive to run interactive analytic queries. It was open sourced by facebook in late 2013 and is targeted at analysts who expect response times ranging from sub-second to minutes. Since both of them are SQL implementations for Big Data, it raises the question: do we need both?

At Qubole, we spend a lot of time working on the internals of both Presto and Hive. In this talk, we will use our experiences and observations to explain why both technologies are required in a big data project. We will then contrast the two technologies in terms of architecture and performance. Finally, we will touch upon the best practices where Presto and Hive can co-exist in a cloud environment providing intuitive and powerful ways to interact with our data.

Requirements

Participants need to have basic understanding of big data analytics.

Speaker bio

Vikram Agrawal is a hacker at Qubole. He is currently focussing on Presto Internals with an emphasis to make it behave well in the cloud. Before Qubole, he co-founded uniRow, an online video conferencing platform, where he led all R&D efforts for the company. He has a Bachelor's and Master's degree in Computer Science from IIT, Delhi.

Shubham Tagra is working on Presto at Qubole with an emphasis on its feature improvements and performance evaluation against Hive. Before this he has worked at NetApp in Storage Area Networks. Shubham has a bachelor's degree from NIT Surathkal in Computer Science.

Links

Slides

http://www.slideshare.net/qubolemarketing/big-dataproposal

Comments

  • 1
    Vinayak Hegde (@vin) 4 years ago

    Hi Vikram,

    Can you add slides to your proposal ? It will help us understand the topic a bit more. Please also add links to Presto. Links to docs with benchmarks and architecture would be useful.

  • 1
    Vikram Agrawal (@vikram) Proposer 4 years ago

    Sure Vinayak. I will add the slides and the links within a couple of days

Login with Twitter or Google to leave a comment