The Fifth Elephant 2016

India's most renowned data science conference

Hadoop & Cloud Storage: Object Store Integration in Production

Submitted by Rajesh Balamohan on Friday, 15 July 2016

videocam_off

Technical level

Intermediate

Section

Crisp talk

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +1

Abstract

Today’s typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS.

Outline

In this session, I am going to explore challenges mentioned in abstract and present recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.

Speaker bio

Rajesh Balamohan is a “Member of Technical Staff” in Hortonworks. He has been working on Hadoop for last couple of years. Recently he has been concentrating on Tez performance at scale. Rajesh is a committer and PMC in Apache Tez project.

Slides

https://drive.google.com/open?id=19snM3dNusd7TqonJXrQZ-Vtx0UmHGsRv0lVoFlITCUg

Comments

Login with Twitter or Google to leave a comment