Rootconf 2018

On scaling infrastructure and operations

Building a reliable and scalable metrics aggregation and monitoring system

Submitted by Vishnu Gajendran (@ggvishnu29) on Friday, 9 March 2018

videocam
Preview video

Technical level

Intermediate

Section

Full talk

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +18

Abstract

In today’s world, running hundreds of microservices on thousands of VMs interacting with each other on a constant basis is a norm. With the increase in scale, ensuring that your system is healthy has become extremely difficult. Apart from that you also need important business metrics which can help you make further decisions. So It becomes very crucial to get stats about various services and also the servers on which services run. But, it is not a easy task to gather millions of metrics data-points generated every minute from various sources, aggregate them & ensure seamless querying of those metrics. In this talk, we propose a design to build a highly reliable and scalable system for metrics aggregation. We will also cover how to build a distributed monitoring system which query the metrics and send alerts to your alerting system. We have implemented the proposed solution at Exotel and we are using the system for metrics aggregation & monitoring for last 1 year.

Outline

Outline:

Why we need a metrics aggregation & monitoring system?
Various components of a good metrics aggregation & monitoring system
Insight about available products/services to use for metrics aggregation & monitoring like datadog
Data pipeline design & reasoning for the proposed design
Monitoring system design
How to ensure high availability of the monitoring system itself?
Findings & Future improvements based on our experience

Speaker bio

Vishnu is a SDE 3 at Exotel, a cloud telephony service company based out of Bengaluru. He focuses on building reliable & scalable data platform that serves various data related products of Exotel. His areas of interest are distributed database systems, big data processing. Prior to Exotel, he has worked at Amazon Web Services, building systems that provide big data products like Hadoop, HBase, Spark etc… as a service to customers.

Apart from work, he is passionate about teaching. He visits colleges and conducts talks & workshops for students on CS topics.

Links

Slides

https://www.slideshare.net/secret/DUGxPUPVtPEq1Y

Preview video

https://youtu.be/Mm2Nj4IjfsA

Comments

  • 1
    Pooja Shah (@p00j4) 7 months ago (edited 7 months ago)

    Hi Vishnu, Aggregation and dashborading sounds very interesting and I see a lot of potential for takeaways in this talk. Have gone through the intro video and slides and like the starting with addressing what didn’t work and then why you chose other ides/tools.
    A few quick queries

    • How do you plan to generalise this talk for all attendees who are at different levels? just an example: I believe, taking a small common problem and then building a story on top. Hoping it to be easier for audience to connect instead of imagining it like only Exotel’s probelms-solutions.
    • Do you plan to open-source this solution?
    • 1
      Vishnu Gajendran (@ggvishnu29) Proposer 7 months ago

      Hey Pooja,

      Thank you for reviewing my slides. I will explain all components of the metrics pipeline in detail. But, I expect the audience to have some basic knowledge about various components like kafka, Elasticsearch etc… We are using open source services (like kafka, ES etc…) and there is no Exotel proprietary component in the pipeline. We will upload the configurations of each component to our github repo for reference.

      • 1
        Pooja Shah (@p00j4) 7 months ago

        Great, thanks Vishnu. More open source, more good :)

Login with Twitter or Google to leave a comment