The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Seamless Hadoop Deployments - Myth or Reality?

Submitted by Ragesh Rajagopalan (@rajagopr) on Sunday, 30 April 2017

Section: Crisp talk for data engineering track Technical level: Beginner Status: Rejected


Continuous deployment of hadoop workflows is by and large a distant dream for every hadoop engineer. Reducing wastage of compute resources, improving developer productivity, eliminating costly bugs and avoiding data corruption are basic goals for every deployment. Yet, often times these goals are not achieved due to lack of comprehensive test coverage and standard best practices. This in-turn results in non-availability of critical data for downstream applications. Further, there is huge room for improving developer happiness which is a direct function of speed of deployments and ease of rollback mechanisms. If only Hadoop workflows could benefit from standard best practices such as code reviews, unit tests, test deployments etc that are followed for online services, the dream of continuous deployment can become a reality.

In this talk, we discuss in detail about how we have made this dream a reality by taking every release candidate through the various phases of the software development life cycle.


We will discuss about leveraging internal and open-sourced tools to ensure disciplined, quick and quality deployments

Treating every commit as a release candidate ; Testing with datamock (internal tool) ; Deploying to a test cluster before deploying to prod

Executing flows in Azkaban (open-sourced) ; Determining health of your flows from Dr. Elephant

One click deployment with CRT(internal tool)



Speaker bio

I’m currently working as a Senior Software Engineer at LinkedIN responsible for the development of tools and applications to improve developer productivity. I have around 12 years for experience in the software industry with significant experience in payments and ecommerce platforms. For the last one and half years I have been working with the data team at linkedIN and have contributed to opensource projects like Dr. Elephant and Azkaban.

I was reponsible for enabling continous integration for hadoop and spark applications at LinkedIN bringing them at-par with the online services.



  • Sandhya Ramesh (@sandhyaramesh) 3 years ago

    Hi Ragesh, could you upload your slide deck so that we can see the contents? Thanks!

  • Zainab Bawa (@zainabbawa) 3 years ago

    Share draft slides, detailing the content you will cover + two-min preview video explaining what this talk is about and why participants should attend.

  • Ragesh Rajagopalan (@rajagopr) Proposer 3 years ago

    Ok, I have updated the link to draft slides. Is it ok to upload the video little later this week?

Login to leave a comment