Hive and Presto for Big Data Analytics in the Cloud
Submitted by Vikram Agrawal (@vikram) on Tuesday, 20 May 2014
The objective of this talk is to conceptualize the use of Hive and Presto for big data analytics. We will contrast their architecture and use cases, and describe how to take advantage of both these technologies in the cloud.
A big data project typically entails processing terabytes to petabytes of data to produce actionable reports and generate business insights. With the advent of public clouds, it is extremely easy to provision machines for analytic workflows as per usage. Open source projects such as Hadoop, Hive and Presto provides inexpensive big data software to develop such projects and have become valuable tools for data integration and analysis. These technologies are production-ready and are running at scale in organizations like Yahoo and Facebook.
Hive provides a massive, fault-tolerant , data warehouse for ad-hoc querying and analysis of very large distributed datasets. Presto on the other hand is emerging as an alternative to Hive to run interactive analytic queries. It was open sourced by facebook in late 2013 and is targeted at analysts who expect response times ranging from sub-second to minutes. Since both of them are SQL implementations for Big Data, it raises the question: do we need both?
At Qubole, we spend a lot of time working on the internals of both Presto and Hive. In this talk, we will use our experiences and observations to explain why both technologies are required in a big data project. We will then contrast the two technologies in terms of architecture and performance. Finally, we will touch upon the best practices where Presto and Hive can co-exist in a cloud environment providing intuitive and powerful ways to interact with our data.
Participants need to have basic understanding of big data analytics.
Vikram Agrawal is a hacker at Qubole. He is currently focussing on Presto Internals with an emphasis to make it behave well in the cloud. Before Qubole, he co-founded uniRow, an online video conferencing platform, where he led all R&D efforts for the company. He has a Bachelor's and Master's degree in Computer Science from IIT, Delhi.
Shubham Tagra is working on Presto at Qubole with an emphasis on its feature improvements and performance evaluation against Hive. Before this he has worked at NetApp in Storage Area Networks. Shubham has a bachelor's degree from NIT Surathkal in Computer Science.