Tailored OS boot process to auto-recover Vms from read-only state
Submitted by Yogesh Patel (@yogeshjp) on Wednesday, 15 February 2017
Full talk of 40 mins duration
Enterprises use internal hosting with HA virtualized environments hosted on VMware/KVM’s. To achieve HA virtualized environment, we need cost effective storage to serve as datastore – in which case NFS (or NAS, which will be used interchangeably, but mean the same) storage wins over SAN. This approach has been adopted by quite a few companies, however, though NFS storage is the one of the cheapest option, it comes with a risk – the risk of low reliability as compared to FC storage.
Using NAS storage renders the virtualized infra susceptible to network outages. These have led to operating system related file system inconsistencies resulting in VMs landing in a read-only state and requiring manual intervention to fix them. Recovering thousands of VMs manually - logging into Management Console, launching server console, repairing the root file system and booting up - warrants a highly orchestrated effort and can be highly time consuming.
Not having a comprehensive solution, to recover from such outages, poses a high risk for enterprises. This can easily snowball into wider customer-impacting application outages with undefined Recovery Time Objective (RTO).
Why we want to discuss this:
• Moving to the Cloud, or fault tolerant infra, can solve this but that is yet to be a reality for most Enterprises • To manage current hosting solutions and their limitations this was a problem which needed to be addressed
There can be different ways to solve this problem:
• Move to Cloud or fault tolerant, auto-scaling infra – which is still some time away • Offer tiered hosting plans i.e. Silver, Gold and Platinum plans based on Cost+Availability factors • We’ve tried to solve the problem by letting the VM preemptively recover the file systems to auto-recover thereby reducing MTTR and defining RTO. This allows companies to continue enjoying cost-effective HA virtualized environment and improving availability/MTTR
How we did it:
By tailoring the Linux boot process to auto-recover VMs from read-only state. Join us to know more.
Introduction - 15 mins
Share the problem statement
Content delivery on the how we solve - 15min
Be able to project. We will have 2 speakers for this proposal,Neeraj Saigal and me. We will need two set of mic.
Neeraj Saigal works for Intuit India and leads the service delivery function in Product infrastructure group. He is passionate about solving problems and come up with effective solution for it.
Yogesh Patel, I work for Intuit India as a Staff Systems engineer in Product Infrastructure group and am responsible for consulting, hosting and deployment strategies for the product teams.