The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

Submitted by Anant Nag (@nntnag17) on Friday, 28 April 2017

Section: Full talk for data engineering track Technical level: Intermediate Status: Rejected


As a Hadoop developer, do you want to quickly develop your Hadoop/Spark workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write unit tests for your workflows and add assertions on top of it?

In just a few years, the number of users writing Hadoop/Spark jobs at LinkedIn have grown from tens to hundreds and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production.

We’ve tried to address these issues by creating a testing framework for Hadoop/Spark jobs. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output.


  1. Brief overview of the problems faced by a Big Data developer in testing
  2. Motivation behind the testing framework
  3. Deep dive into the design and architecture of testing framework
  4. How can data scientists/ hadoop developers leverage testing framework

Speaker bio

Anant Nag is a Senior Software Engineer at LinkedIn. He has worked on multiple projects involved in the Hadoop workflow lifecycle. He is one of the core developers of popular open source projects - Dr.Elephant and Linkedin Gradle plugin for Apache Hadoop.

Currently, Anant is focussing on increasing Hadoop developer productivity at LinkedIn. He is working on a testing framework for deploying and testing Hadoop workflows. Anant will also be speaking in DATAWORKS SUMMIT SAN JOSE 2017 on testing framework



Preview video


  • Zainab Bawa (@zainabbawa) 3 years ago

    Share draft slides, detailing the content you will cover + two-min preview video explaining what this talk is about and why participants should attend.

  • Neon (@neon290) 11 months ago

    Deployment and testing are good to use during the project because it helps to idenitfy the problems. You can join for more guides regarding the testing and deployment. Because they are working in the field.

  • Natalie Portman (@natalie) 2 months ago

    In only a couple of years, the quantity of clients writing Hadoop/Spark occupations at LinkedIn have developed from tens to hundreds and the quantity of employments running each day has developed from hundreds to thousands. With the ever-expanding number of clients and employments, it gets vital to decrease the advancement time for these occupations. It is likewise essential to test these occupations altogether before they go to creation.

  • Nida Amber (@nidamber5) 2 months ago

    my question is the logo and the any design it self a testing??? if have then who i am working on a where i just needed to aslk my customer either this is perfect and he replied test your self so who i test this is all cstomer choice i think. please help me out

Login to leave a comment