Dr. Elephant - Self-Serve Performance Tuning for Hadoop and Spark
Submitted by Akshay Rai (@akshayrai) on Monday, 25 April 2016
Hadoop is a framework that facilitates the distributed storage and processing of large distributed datasets involving a number of components interacting with each other. Because of its large and complex framework, it is important to make sure every component performs optimally. While we can always optimize the underlying hardware resources, network infrastructure, OS, and other components of the stack, only users have control over optimizing the jobs that run on the cluster.
Dr. Elephant is a tool for the users of Hadoop to help them understand, analyse and tune their Hadoop/Spark applications easily, thus improving their productivity and the cluster’s efficiency. It analyzes the Hadoop and Spark jobs using a set of pluggable, configurable, rule-based heuristics that provide insights on how a job performed, and then uses the results to make suggestions about how to tune the job to make it perform more efficiently.
Phase 1: I’ll share the experience at Linkedin in optimizing the user jobs, the challenges we faced and how a simple self serve tool like Dr. Elephant helped overcome these challenges.
Phase 2: I’ll share how we integrated such a tool into our developer lifecycle and encouraged them to optimize the jobs with minimal support from the hadoop experts.
Phase 3: This phase will involve discussions about the tool, how it analyses the job by gathering all the diverse information, how to write custom heuristics and plug them into Dr. Elephant, comparing and analysing job executions etc.
Akshay Rai is an engineer at Linkedin working for the Hadoop development team. He has been working on Dr. Elephant for more than a year and has worked extensively to help open source this tool. Since the open source announcement last week, he has been actively engaging in discussions with the community and leading this project.