Rootconf 2014

On devops and cloud infrastructure

Jabir Ahmed


Production is Priority - Self Fix / Heal Techniques

Submitted Jan 29, 2014

To understand how to

  1. Monitor systems
    a. Nagios
    b. Ganglia
  2. Analyse Root cause
  3. Automate the fix
  4. Log / Record Incidents


Production systems are always P1 and keeping them up & scaling them is what keeps everyone on their toes

How ever we have cracked some important automation that could drastically make a devOps engineers life easier.

We let our 1000+ servers across 4 regions heal by themselves, and let the Operations team focus on bigger tasks that could add more impact to the organisation.

This ensures that we are not doing the same task over and over again, increases productivity and scalability across the application stacks.

A simple example would be something like log rotate, which ensures that we don’t keep cleaning logs every day but it does that task over and over again on your behalf to ensure logs get purged everyday

Question : I have a use-case that does not have a solution in the open source community..

Answer : Customise it.. you would be able to plugging scripts and hooks to fix the problem.

Will discuss on how its done by us!



  1. Nagios.
  2. Any flavor of Linux
  3. Bash/Shell Scripting
  4. Scripting Perl / Python
  5. Programming / Automation
  6. ActiveMQ added advantage.

Speaker bio

Jabir Ahmed.
Hadoop Big Data Platform Team @ Inmobi

Tech Lead
Hadoop System Engineer, Yahoo, Bangalore


{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}