Real-time Monitoring of Big Data Workflows
Submitted by Akshay Rai (@akshayrai) on Friday, 28 April 2017
Full talk for data engineering track
Do you want to know the real-time status of your big data job? Not sure of how to collect all the metrics from these jobs and make sense out of them? Want to track and monitor the metrics in real time? Want to track the historical performance of your job? Want to build business reporting dashboards?
The Big Data world has a variety of frameworks to run Hadoop and Spark jobs and tracking these jobs in real time is a huge challenge. The frameworks have kept on increasing but very few attempts have been made to comprehensively monitor these jobs and optimize them.
In this talk, we will discuss a framework to collect and stream big data metrics from different sources in real time, capture them in a metrics OLAP store, run analytics, monitor and alert on them.
For a company like Linkedin, where we run thousands of production BigData workflows, it is important to dedicate enough resources to all the critical workflows. It is also important from a business perspective that these flows don’t waste the resources. In such a scenario it is very crucial to have a system which can easily help visualize all the Workflow metrics and help monitor, debug and optimize the workflows. In addition, it would be really cool if we as developers get automatically alerted when something goes wrong in the workflow like the output records suddenly dropped or there was a spike in the delay and then do slice and dice over the individual delays and figure out what caused it. Whether it was the workflow itself that had an issue or was it due to heavy load on the cluster?
We will discuss the architecture and design for building such a near realtime framework that includes components like Apache Kafka, Samza and OLAP stores. We will also discuss the scope of such a Big Data Metrics store and how consumers like Dr. Elephant can consume from such Metrics Stores and do a lot of analytic processing on them and potentially also auto-tune the jobs based on some models.
Akshay Rai is an engineer at Linkedin working with the Grid team. He is also the lead engineer for the open sourced Dr. Elephant project by Linkedin. He has been working on solutions to improve the developer productivity and building systems to monitor Big Data applications in real time.