The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Ragesh Rajagopalan

@rajagopr

Seamless Hadoop Deployments - Myth or Reality?

Submitted Apr 30, 2017

Continuous deployment of hadoop workflows is by and large a distant dream for every hadoop engineer. Reducing wastage of compute resources, improving developer productivity, eliminating costly bugs and avoiding data corruption are basic goals for every deployment. Yet, often times these goals are not achieved due to lack of comprehensive test coverage and standard best practices. This in-turn results in non-availability of critical data for downstream applications. Further, there is huge room for improving developer happiness which is a direct function of speed of deployments and ease of rollback mechanisms. If only Hadoop workflows could benefit from standard best practices such as code reviews, unit tests, test deployments etc that are followed for online services, the dream of continuous deployment can become a reality.

In this talk, we discuss in detail about how we have made this dream a reality by taking every release candidate through the various phases of the software development life cycle.

Outline

We will discuss about leveraging internal and open-sourced tools to ensure disciplined, quick and quality deployments

Discipline:
Treating every commit as a release candidate ; Testing with datamock (internal tool) ; Deploying to a test cluster before deploying to prod

Quality:
Executing flows in Azkaban (open-sourced) ; Determining health of your flows from Dr. Elephant

Speed:
One click deployment with CRT(internal tool)

Requirements

None

Speaker bio

I’m currently working as a Senior Software Engineer at LinkedIN responsible for the development of tools and applications to improve developer productivity. I have around 12 years for experience in the software industry with significant experience in payments and ecommerce platforms. For the last one and half years I have been working with the data team at linkedIN and have contributed to opensource projects like Dr. Elephant and Azkaban.

I was reponsible for enabling continous integration for hadoop and spark applications at LinkedIN bringing them at-par with the online services.

Slides

https://docs.google.com/presentation/d/1wbEK17IjV0hUUXbM8Ay5buhvbtWFbsK9kVi6o55EbNs/edit?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures