Preparing for failure - resilient system architecture
Submitted by Soham Chakraborty (@sohamchakraborty) on Saturday, 30 January 2016
Systems do fail. There are multitude of components that could fail any time. Therefore, one could think of introducing factors that might lead to failure and thus eliminating one angle of a possible future failure. This talk aims to provide some such ideas.
If we are hosting our infrastructure in cloud, then we must consider the components that are beyond organizational control. That could be hardware, underlying virtualization issues, security issues or anything else. It is nearly impossible to predict what could go wrong and therefore we can introduce ‘agents-of-failure’ deliberately, then we might get an overview of what could fail and when that could fail. This gives us a context - perhaps hitertho not discussed - to think of an approach which might mitigate that failure. Netflix is a pioneer in this approach and we will pick up certain methods that they used to illustrate why thinking in that line could help others as well.
Along with that, preparing for such events, gives us the habit of thinking in terms of disposable systems, which essentially means, that if a system is unhealthy, instead of trying to make it healthy, we introduce another fresh system. A post-mortem of what failed can be done later.
This talk will aim to provide such few thoughts.
Soham Chakraborty is a systems operations engineer in Pythian. Prior to that, he has worked in Red Hat, IPsoft and Poornam Infovision.