Eye on Infrastructure
In this talk, I would be sharing my point of views on Network monitoring.
As we all know that, how important is network monitoring in our estate and Cloud.
I will be explaining in technical details, how we have moved from standard monitoring to advanced automated monitoring. We have developed event correlation engines that can perform most of the troubleshooting and repeated tasks that a network engineer used to perform. I will demonstrate how we can achieve better Mean Time To Detect failures.
Implementing the network monitoring by telegraf using SNMP, “Elastic Container Services” , advanced python scripts and AWS native services like Lambda. Monitoring network elements using Wavefront and sending alerts to customized slack channels with runbooks.
Advanced Event correlation engine-Network troubleshooting –
Event correlation Engine is an automated troubleshooting engine to improve the MTTD and MTTR. When alerts triggering event then workflow will login to the network environment and validate for failures and correlates the dependencies automatically. Network engineers will now have all the analysis performed for an alert within seconds and do not have to necessarily login to devices to understand the reason for a specific alert. The workflow also provides potential impact and action to be taken for a specific alert by using AWS native services like API gateway, Step functions, lambda and DB’s to achieve this functionality.
The conclusion output will be notified to the Network team for remediation.
As a next step I would propose to incorporate auto remediation and same can be reviewed and discussed with audience.
Target Audience: Networking folks who are interested in advanced network monitoring and automated workflow using python and cloud services
Key Takeaways: How we can have advanced monitoring and network troubleshooting automation as part of service which can improve Mean time to detect and Mean time to restore.
Monitoring Network devices using advanced AWS native services
Automated Alerting notifier via Slack channel with run books
Event Correlation Engine
Advanced network troubleshooting automation workflow using python3 and AWS advanced services
Auto Incident assignment and Auto remediation possible ?
Questions and Answers
AWS cloud learning account
Swaminathan S working as staff network security engineer in Intuit Technologies with overall experience of 15+ years in network. We have Datacenter and cloud environments as hybrid infrastructure supporting business applications and have implemented multiple projects related to monitoring solutions and automation projects.