Rootconf 2017

On service reliability

Yogesh Patel

@yogeshjp

Tailored OS boot process to auto-recover Vms from read-only state

Submitted Feb 15, 2017

Enterprises use internal hosting with HA virtualized environments hosted on VMware/KVM’s. To achieve HA virtualized environment, we need cost effective storage to serve as datastore – in which case NFS (or NAS, which will be used interchangeably, but mean the same) storage wins over SAN. This approach has been adopted by quite a few companies, however, though NFS storage is the one of the cheapest option, it comes with a risk – the risk of low reliability as compared to FC storage.

Using NAS storage renders the virtualized infra susceptible to network outages. These have led to operating system related file system inconsistencies resulting in VMs landing in a read-only state and requiring manual intervention to fix them. Recovering thousands of VMs manually - logging into Management Console, launching server console, repairing the root file system and booting up - warrants a highly orchestrated effort and can be highly time consuming.

Not having a comprehensive solution, to recover from such outages, poses a high risk for enterprises. This can easily snowball into wider customer-impacting application outages with undefined Recovery Time Objective (RTO).

Why we want to discuss this:
• Moving to the Cloud, or fault tolerant infra, can solve this but that is yet to be a reality for most Enterprises
• To manage current hosting solutions and their limitations this was a problem which needed to be addressed

There can be different ways to solve this problem:
• Move to Cloud or fault tolerant, auto-scaling infra – which is still some time away
• Offer tiered hosting plans i.e. Silver, Gold and Platinum plans based on Cost+Availability factors
• We’ve tried to solve the problem by letting the VM preemptively recover the file systems to auto-recover thereby reducing MTTR and defining RTO. This allows companies to continue enjoying cost-effective HA virtualized environment and improving availability/MTTR

How we did it:
By tailoring the Linux boot process to auto-recover VMs from read-only state. Join us to know more.

Outline

Introduction - 15 mins
Introduce ourselves
Share the problem statement

Content delivery on the how we solve - 15min

Requirements

Be able to project. We will have 2 speakers for this proposal,Neeraj Saigal and me. We will need two set of mic.

Speaker bio

Neeraj Saigal works for Intuit India and leads the service delivery function in Product infrastructure group. He is passionate about solving problems and come up with effective solution for it.
Yogesh Patel, I work for Intuit India as a Staff Systems engineer in Product Infrastructure group and am responsible for consulting, hosting and deployment strategies for the product teams.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy