Building a reliable and scalable metrics aggregation and monitoring system

May 2018

7 Mon

8 Tue

9 Wed

10 Thu 08:15 AM – 05:25 PM IST

11 Fri 08:30 AM – 06:20 PM IST

12 Sat

13 Sun

NIMHANS Convention Centre, Bengaluru

All submissions

Previous Next

This submission has been added to the schedule

Building a reliable and scalable metrics aggregation and monitoring system

Submitted Mar 9, 2018

Section: Full talk Technical level: Intermediate

In today’s world, running hundreds of microservices on thousands of VMs interacting with each other on a constant basis is a norm. With the increase in scale, ensuring that your system is healthy has become extremely difficult. Apart from that you also need important business metrics which can help you make further decisions. So It becomes very crucial to get stats about various services and also the servers on which services run. But, it is not a easy task to gather millions of metrics data-points generated every minute from various sources, aggregate them & ensure seamless querying of those metrics. In this talk, we propose a design to build a highly reliable and scalable system for metrics aggregation. We will also cover how to build a distributed monitoring system which query the metrics and send alerts to your alerting system. We have implemented the proposed solution at Exotel and we are using the system for metrics aggregation & monitoring for last 1 year.

Outline

Outline:

Why we need a metrics aggregation & monitoring system?
Various components of a good metrics aggregation & monitoring system
Insight about available products/services to use for metrics aggregation & monitoring like datadog
Data pipeline design & reasoning for the proposed design
Monitoring system design
How to ensure high availability of the monitoring system itself?
Findings & Future improvements based on our experience

Speaker bio

Vishnu is a SDE 3 at Exotel, a cloud telephony service company based out of Bengaluru. He focuses on building reliable & scalable data platform that serves various data related products of Exotel. His areas of interest are distributed database systems, big data processing. Prior to Exotel, he has worked at Amazon Web Services, building systems that provide big data products like Hadoop, HBase, Spark etc... as a service to customers.

Apart from work, he is passionate about teaching. He visits colleges and conducts talks & workshops for students on CS topics.

Slides

https://www.slideshare.net/secret/DUGxPUPVtPEq1Y

All submissions

Previous Next

Comments

PS

Pooja Shah

@p00j4
Hi Vishnu, Aggregation and dashborading sounds very interesting and I see a lot of potential for takeaways in this talk. Have gone through the intro video and slides and like the starting with addressing what didn't work and then why you chose other ides/tools.
A few quick queries
- How do you plan to generalise this talk for all attendees who are at different levels? just an example: I believe, taking a small common problem and then building a story on top. Hoping it to be easier for audience to connect instead of imagining it like only Exotel's probelms-solutions.
- Do you plan to open-source this solution?
Posted 6 years ago (edited 6 years ago)
Share
Copy link
Email
Twitter
Facebook
Linkedin
- VG
  
  Vishnu Gajendran
  
  @ggvishnu29 Submitter
  Hey Pooja,
  
  Thank you for reviewing my slides. I will explain all components of the metrics pipeline in detail. But, I expect the audience to have some basic knowledge about various components like kafka, Elasticsearch etc... We are using open source services (like kafka, ES etc...) and there is no Exotel proprietary component in the pipeline. We will upload the configurations of each component to our github repo for reference.
  
  Posted 6 years ago
  
  Share
  Copy link
  Email
  Twitter
  Facebook
  Linkedin
  - PS
    
    Pooja Shah
    
    @p00j4
    
    Great, thanks Vishnu. More open source, more good :)
    
    Posted 6 years ago
    
    Share
    Copy link
    Email
    Twitter
    Facebook
    Linkedin

May 2018

7 Mon

8 Tue

9 Wed

10 Thu 08:15 AM – 05:25 PM IST

11 Fri 08:30 AM – 06:20 PM IST

12 Sat

13 Sun

Hosted by

Rootconf

We care about site reliability, cloud costs, security and data privacy

Rootconf 2018

Building a reliable and scalable metrics aggregation and monitoring system

Outline

Speaker bio

Links

Slides

Comments

Pooja Shah

@p00j4

Vishnu Gajendran

@ggvishnu29 Submitter

Pooja Shah

@p00j4