Rootconf 2017

On service reliability

Tailored OS boot process to auto-recover Vms from read-only state

Submitted by Yogesh Patel (@yogeshjp) on Feb 15, 2017

Section: Full talk of 40 mins duration Technical level: Intermediate Status: Rejected


Enterprises use internal hosting with HA virtualized environments hosted on VMware/KVM’s. To achieve HA virtualized environment, we need cost effective storage to serve as datastore – in which case NFS (or NAS, which will be used interchangeably, but mean the same) storage wins over SAN. This approach has been adopted by quite a few companies, however, though NFS storage is the one of the cheapest option, it comes with a risk – the risk of low reliability as compared to FC storage.

Using NAS storage renders the virtualized infra susceptible to network outages. These have led to operating system related file system inconsistencies resulting in VMs landing in a read-only state and requiring manual intervention to fix them. Recovering thousands of VMs manually - logging into Management Console, launching server console, repairing the root file system and booting up - warrants a highly orchestrated effort and can be highly time consuming. 

Not having a comprehensive solution, to recover from such outages, poses a high risk for enterprises. This can easily snowball into wider customer-impacting application outages with undefined Recovery Time Objective (RTO).

Why we want to discuss this:
• Moving to the Cloud, or fault tolerant infra, can solve this but that is yet to be a reality for most Enterprises • To manage current hosting solutions and their limitations this was a problem which needed to be addressed

There can be different ways to solve this problem:
• Move to Cloud or fault tolerant, auto-scaling infra – which is still some time away • Offer tiered hosting plans i.e. Silver, Gold and Platinum plans based on Cost+Availability factors • We’ve tried to solve the problem by letting the VM preemptively recover the file systems to auto-recover thereby reducing MTTR and defining RTO. This allows companies to continue enjoying cost-effective HA virtualized environment and improving availability/MTTR

How we did it:
By tailoring the Linux boot process to auto-recover VMs from read-only state. Join us to know more.


Introduction - 15 mins
Introduce ourselves
Share the problem statement

Content delivery on the how we solve - 15min


Be able to project. We will have 2 speakers for this proposal,Neeraj Saigal and me. We will need two set of mic.

Speaker bio

Neeraj Saigal works for Intuit India and leads the service delivery function in Product infrastructure group. He is passionate about solving problems and come up with effective solution for it.
Yogesh Patel, I work for Intuit India as a Staff Systems engineer in Product Infrastructure group and am responsible for consulting, hosting and deployment strategies for the product teams.


  • Zainab Bawa (@zainabbawa) Crew 3 years ago

    We only permit one speaker per session. If there are collborators for the project, the collaborator can be present during Q&A for the talk. But the talk has to be presented by only one person.

  • Zainab Bawa (@zainabbawa) Crew 3 years ago

    To complete the review, we need draft slides and link to a self-recorded video explaining what this talk is about and why the audience should attend it. Please share this information, latest by Wednesday, 22 Feb.

  • Zainab Bawa (@zainabbawa) Crew 3 years ago

    Any updates on this proposal?

Login to leave a comment