Netconf 2020 edition

An unconference on the technical, economic and social aspects of network engineering and infrastructure

Eye on Infrastructure

Submitted by Swaminathan S on Mar 10, 2020

Format of the session: Full talk (40 mins) Status: Under evaluation

Abstract

Abstract:

In this talk, I would be sharing my point of views on Network monitoring.

As we all know that, how important is network monitoring in our estate and Cloud.

I will be explaining in technical details, how we have moved from standard monitoring to advanced automated monitoring. We have developed event correlation engines that can perform most of the troubleshooting and repeated tasks that a network engineer used to perform. I will demonstrate how we can achieve better Mean Time To Detect failures.

Implementing the network monitoring by telegraf using SNMP, “Elastic Container Services” , advanced python scripts and AWS native services like Lambda. Monitoring network elements using Wavefront and sending alerts to customized slack channels with runbooks.

Advanced Event correlation engine-Network troubleshooting –

Event correlation Engine is an automated troubleshooting engine to improve the MTTD and MTTR. When alerts triggering event then workflow will login to the network environment and validate for failures and correlates the dependencies automatically. Network engineers will now have all the analysis performed for an alert within seconds and do not have to necessarily login to devices to understand the reason for a specific alert. The workflow also provides potential impact and action to be taken for a specific alert by using AWS native services like API gateway, Step functions, lambda and DB’s to achieve this functionality.
The conclusion output will be notified to the Network team for remediation.

As a next step I would propose to incorporate auto remediation and same can be reviewed and discussed with audience.

Target Audience: Networking folks who are interested in advanced network monitoring and automated workflow using python and cloud services

Key Takeaways: How we can have advanced monitoring and network troubleshooting automation as part of service which can improve Mean time to detect and Mean time to restore.

Outline

Outline

Introduction

Monitoring Network devices using advanced AWS native services

Automated Alerting notifier via Slack channel with run books

Event Correlation Engine

Advanced network troubleshooting automation workflow using python3 and AWS advanced services

Auto Incident assignment and Auto remediation possible ?

Questions and Answers

Requirements

AWS cloud learning account

Speaker bio

Swaminathan S working as staff network security engineer in Intuit Technologies with overall experience of 15+ years in network. We have Datacenter and cloud environments as hybrid infrastructure supporting business applications and have implemented multiple projects related to monitoring solutions and automation projects.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('You need to be a participant to comment.') }}

{{ formTitle }}
{{ gettext('Post a comment...') }}
{{ gettext('New comment') }}

{{ errorMsg }}