The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Nishant Bangarwa

@nishantbangarwa

Unlock sub-second SQL analytics over terrabytes of data with Hive and Druid

Submitted Jun 7, 2017

Druid is an open-source analytics data store designed for business inteligence OLAP queries on timeseries data. Druid provides low latency real-time data ingestion, flexible data exploration and fast data aggregation. Many organizations have deployed Druid to analyze ad-tech, dev-ops, network traffic, website traffic, finance, sensor and IOT data.

Druid’s strong points are very compelling but there are some important features like large joins and full SQL support. This talk will present how Druid and Apache Hive can be used together to index large amounts of data and query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We will walk through the architecture of the solution leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with a demo highlighting the performant and powerful integration of these projects.

Outline

Introduction to HIVE and Druid
Why HIVE + Druid
Architecture
Demo
performance results

Speaker bio

Nishant is Druid PMC member and Software Engineer at Hortonworks. He is part of Business Intelligence team at Hortonworks. Prior to that he was part of Metamarkets backend team and was responsible for analytics infrastructure, including real-time analytics in Druid. He holds a B.Tech in Computer Science from National Institute of Technology, Kurukshetra, India.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures