Rootconf proposals for round the year in 2018

Rootconf proposals for round the year in 2018

On DevOps, security, cloud and IT infrastructure

Submit proposals on:

  1. Security
  2. DevOps
  3. Cloud
  4. Architecture
  5. Infrastructure
  6. DataOps
  7. Microservices
  8. Distributed Systems

This funnel is open round the year for meetups and smaller conferences in different cities. Submit right away!

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

Nandish Madhu

Monitoring - How to make it work?

Submitted Apr 12, 2018

Monitoring is a never ending but exciting journey for all of us. In this talk, I would like to share few effective practices that will deliver significant outcomes measured by Time to Detect an incident. With the infinite number of monitoring tools that are available and the features they offer, we tend to map/adjust our requirements based on the tool functionalities. Being grounded on what is important in the monitoring world and developing a simple but effective strategy for monitoring will complement the amazing features offered by these monitoring tools. My presentation is going to be focused more towards infrastructure monitoring but the approach could be applied to all monitoring efforts. While I cover topics on effective monitoring approach that has worked for me and my team in reducing Time to Detect, I would end the presentation by leaving behind few thought process with the audience on the next logical step which is Time to Restore.

Outline

Introduction - 5 mins
Introducing myself and setting context of what would be covered as part of the presentation

Content delivery on Time to Detect - 20 mins
• As part of the main content delivery, I would start with foundation of what comes to our mind when we think about monitoring and progress towards how we enable monitoring, how do we ensure it is working and why it is important to spend significant time and effort on validation.
• Few key topics that would be covered are:
• Onboarding process
• Commitment/Ownership from the leaders
• Validation, Validation, Validation (Various aspects of validation)

Closure notes with importance of Time to Restore - 5 mins
Effective monitoring and alerting can help improve Time to Detect. Once we know what went wrong, several factors need to be considered to quickly restore the services. Reducing business impact is the ultimate goal of any monitoring effort. I would to share my 2 cents in this regard as a closure note.

Speaker bio

I work for Intuit and lead the group responsible for Datacenter Network Engineering. Having iterated multiple approaches towards effective monitoring in my present and past assignments, I am passionate about sharing my experience with our friends in the industry.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more