Rootconf 2017

On service reliability

Nandish Madhu

@nandishmadhu

Monitoring – Does it always work?

Submitted Feb 10, 2017

Availability and Uptime of customer Offerings is key to the business. Monitoring forms an important aspect of Business Continuity and is an area that never seems to be as complete as we would desire it to be.

In this talk, I would like to share few effective practices with realtime examples that we have used to significantly reduce Time to Detect an incident.

With the large number of monitoring tools that are available and the features they offer, we tend to map/adjust our requirements based on the capabilities of these tools. Being grounded on what needs to be monitored and nailing the fundamentals will be the key to success. My presentation is going to be focused towards infrastructure monitoring but the approach could be applied to all monitoring efforts. While I cover topics on effective monitoring approach that has worked for me and my team in reducing Time to Detect, I would end the presentation by leaving behind few thoughts with the audience on the next logical step which is Time to Restore.

Outline

Introduction - 5 mins
Introducing myself and setting the context of what would be covered as part of the presentation

Content delivery on Time to Detect - 20 mins
• As part of the main content delivery, I would start by grounding the audience on why monitoring is important.
• Few key topics that would be covered are:
• Commitment/Ownership from the leaders
• Onboarding process for devices to be monitored and workflow definition
• Validation, Validation, Validation (Various aspects of validation)

Closure notes with importance of Time to Restore - 5 mins
Effective monitoring and alerting can help improve Time to Detect. Once we know what went wrong, several factors need to be considered to quickly restore the services. Reducing business impact is the ultimate goal of any monitoring effort. I would to share my 2 cents in this regard as a closure note.

Requirements

Assuming my laptop could be connected to the projector, I do not foresee any other requirements.

Speaker bio

I work for Intuit and lead the group responsible for Datacenter Network Engineering. Having iterated multiple approaches towards effective monitoring in my present and past assignments, I am passionate about sharing my experience – both wins and challenges - with our friends in the industry.

Slides

https://docs.google.com/presentation/d/1uU3YBGFbV-rlPwAPmSSYWg68UXUbcrKYez-EE7Xzv0k/edit?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy