The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

It takes two to tango! - Is SQL-on-Hadoop the next big step?

Submitted by Srihari Srinivasan (@srihari) on Friday, 12 April 2013

Section: Storage and Databases Technical level: Intermediate


To explore the trend of SQL-on-Hadoop. This talk will focus on some of the recent attempts (OSS and Commercial) to get SQL running on Hadoop.


Since early days the Hadoop community has made several attempts to stretch Hadoop beyond its role as a distributed programming framework. The key strength Hadoop that brings to the table is its ability to scale linearly. Can we combine this advantage of Hadoop with the efficiency of databases? What does it take to run SQL over Hadoop?

Running SQL-on-Hadoop implies accessing data from "within" Hadoop using SQL as the interface. Accomplishing this demands a significant re-architecture of the storage and compute infrastructures.

SQL-on-Hadoop shifts Hadoop's role from being a technology, viewed so far as complementary to databases into something that could compete with them. Its perhaps the feature that will help Hadoop find its way into more enterprises without them having to reinvent themselves as Map Reduce experts. As a result of this we perhaps won't need separate data stores for structured and unstructured data in the future!

Speaker bio

Srihari currently heads the technology organization for ThoughtWorks India. He's been a developer and architect for several enterprise applications with focus on building large scale systems based on service oriented architectures, domain specific languages etc. He is quite passionate about distributed systems and databases and blogs about them on


  • Govind Kanshi (@govindsk) 6 years ago

    Thanks srihari for sharing your thoughts, this is imp as for mass scale adoption of underlying hadoop infra - familiar dsl is required. We are seeing vendor specific, or oss ways (hawq,polybase, impala, hwks effort). It will also settle folks into a comfort factor as we go back looking at familiar operator (physical or logical) cost rather than …when looking at query plans.

  • Srihari Srinivasan (@srihari) Proposer 6 years ago

    Govind, Thanks for the comment. Just as an FYI - I intend to cover Impala and Polybase’s architectures in the talk. One is an example of Distributed Query processing and another of Split Query processing.

  • t3rmin4t0r (@t3rmin4t0r) 6 years ago

    I think it would be amiss to not cover Hive, which kicked off this in a big way, enabling SQL to be translated to Map-Reduce.

    Because both Polybase & Impala skips are pretty much Sql-off-hadoop solutions, which just piggy-back on top of HDFS, without really using MR and just reading data from hdfs (IIRC in Polybase, HDFS is a second-class citizen in comparison to azure fs).

    • Srihari Srinivasan (@srihari) Proposer 6 years ago

      Absolutely right! I do cover Hive although not in the same proportion as the other two. Just did not mention it in the earlier comment :)

Login with Twitter or Google to leave a comment