Monitoring Distributed Systems with Riemann
Submitted by Abhishek A Amralkar (@aamralkar) on Wednesday, 26 December 2018
Full talk of 40 mins duration
Monitoring modern real time distributed infrastructure is complex and expensive. In this talk we explore Riemann, specifically, how Riemann low latency helped us to get real time metrics from our Distributed Systems.
Large scale real time distributed systems require emitting hundreds of thousands of metrics per seconds for effective monitoring. A significant portions of metrics are either not of any use or we don’t understand them. With the rapid growth in infrastructure, monitoring infrastructure in real time and getting accurate metrics becomes challenging specially when you have in-house monitoring setup.
Most monitoring systems are pull/poll based where your monitoring system queries the components being monitored. Pull based monitoring system where monitoring system keep changing some x values in every y minutes is literally dead.
Riemann is a monitoring tool that aggregates events from hosts, servers and applications and can feed them into a stream processing language to be manipulated, summarized or action-ed. Riemann is fast and highly configurable. Most importantly it is an event-centric push model.
We use Riemann to monitor Distributed Systems. Catching problems in real time requires monitoring tools that have low latency that lets you see outages faster so that one can identify the problem and see if the fix works instantly. Riemann provide this along with a transient shared state for systems with many moving parts.
Riemann is written in Clojure and leverage the core concepts of Clojure like Performance, Low Latency. Riemann Configs are Clojure code.
We will walk through the concepts of Riemann
Events Streams Indexes
We will also go over how to run Riemann in Production environment and how to write Riemann Clojure configs.
We will conclude our talk with the demo for monitoring distributed systems like Apache Zookeeper.
As such there will be no technical requirements.
I’d talk about how Riemann works and how to monitoring Distributed Systems using Riemann . Me and my team handled massive distributed system infrastructure in Cloud with availability with Five 9’s with the help of Riemann monitoring and alerting, which helped us to catch problems in real time and react faster.
I am Abhishek Amralkar leads the Cloud Infrastructure/DevSecOps Team at Talentica Softwares, where I design, architects the next generation Cloud infrastructure in a cost-effective and reliable manner without compromising on infrastructure and application security. I have experienced in working across various technology domains like Data Center Security, Cloud Operations, Cloud Automation, Writing tools around infrastructure and Cloud Security.
My current focus is on Cloud, Security Operations and Clojure and other Functional Languages..