Production is Priority - Self Fix / Heal Techniques

May 2014

12 Mon

13 Tue

14 Wed 10:00 AM – 06:30 PM IST

15 Thu 10:00 AM – 06:30 PM IST

16 Fri 09:30 AM – 10:30 PM IST

17 Sat 09:30 AM – 06:15 PM IST

18 Sun

The Energy & Resources Institute, Bangalore

All submissions

Previous Next

This submission has been added to the schedule

Production is Priority - Self Fix / Heal Techniques

Submitted Jan 29, 2014

Section: Full talk Technical level: Intermediate

To understand how to

Monitor systems
a. Nagios
b. Ganglia
Analyse Root cause
Automate the fix
Log / Record Incidents

Outline

Production systems are always P1 and keeping them up & scaling them is what keeps everyone on their toes

How ever we have cracked some important automation that could drastically make a devOps engineers life easier.

We let our 1000+ servers across 4 regions heal by themselves, and let the Operations team focus on bigger tasks that could add more impact to the organisation.

This ensures that we are not doing the same task over and over again, increases productivity and scalability across the application stacks.

A simple example would be something like log rotate, which ensures that we don’t keep cleaning logs every day but it does that task over and over again on your behalf to ensure logs get purged everyday

Question : I have a use-case that does not have a solution in the open source community..

Answer : Customise it.. you would be able to plugging scripts and hooks to fix the problem.

Will discuss on how its done by us!