Stream Processing in production: Metrics that matter
Submitted by Siddhartha Reddy (@sids) on Monday, 15 June 2015
Section: Crisp Talk Technical level: Intermediate
Understand what are some useful metrics to monitor the health of stream processing jobs (such as Apache Storm topologies) when they are deployed in production. Also get some ideas on how to capture these metrics (including suggestions for libraries & tools), and how to proactively mitigate the problems from escalating.
Stream Processing platforms such as Apache Storm have become pretty commonplace. They are used for powering a variety of applications such as real-time analytical dashboards and other data-driven applications such as recommendations. We have also seen Storm being employed simply as a distributed fault-tolerant runtime for applications that need to consume data from a queue and do some operations on it.
But because these jobs typically don’t come in user path, they are often not monitored well or at all. Or put another way, the only monitoring some of them have is the business guys alerting us by shouting “hey, my analytics dashboards are stale!”
Flipkart’s Data Platform hosts hundreds of stream processing applications and several of these are critical for our business. As such, we can’t afford to not monitor their health. So we evolved a whole bunch of metrics that we monitor for each of these jobs. These metrics are displayed as a part of our platform health dashboards which are displayed on large TV screens in our team area; we connected them to our alerting system to warn us about any mishaps; we have even set up some automated corrective actions to be taken based on some of them.
In this talk we’ll describe the metrics we monitor for each stream processing job, how we capture them, the libraries and tools we use, how we track them, and how we act on them.
Siddhartha is an Architect at Flipkart, working on the company’s central Data Platform. He is responsible among other things for developing and operating a multi-tenant stream processing platform.
Aniruddha is a Software Engineer in the Data Platform team at Flipkart. He has worked on building and operating Storm topologies for various stream processing requirements.