Rootconf 2018

Rootconf 2018

On scaling infrastructure and operations

##About Rootconf 2018 and who should attend:

Rootconf is India’s best conference on DevOps, SRE and IT infrastructure. Rootconf attracts systems and operations engineers to share real-world knowledge about building reliable systems.

The 2018 edition is a single track conference. Day 1 – 10 May – features talks on security. Colin Charles (chief evangelist at Percona Foundation), Pukhraj Singh (former national cybersecurity manager at UIDAI), Shamim Reza (open source enthusiast), Alisha Gurung (network engineer at Bhutan Telecom) and Derick Thomas (former network engineer at VSNL and Airtel Bharti) will touch on important aspects of infrastructure, database, network and enterprise security.

Day 2 – 11 May – is filled with case studies and stories about legacy code, immutable infrastructure, root-cause analysis, handling dependencies and monitoring. Talks from Exotel, Kayako, Intuit, Helpshift, Digital Ocean, among others, will help you evaluate DevOps tools and architecture patterns.

If you are a:

  1. DevOps programmer
  2. Systems engineer
  3. Architect
  4. VP of engineering
  5. IT manager

you should attend Rootconf.

Birds Of Feather (BOF) sessions at Rootconf 2018 will cover the following topics:

  1. DevSec Ops
  2. Microservices - tooling, architecture, costs and culture
  3. Mistakes that startups make when planning infrastructure
  4. Handling technical debt
  5. How to plan a container strategy for your organization
  6. Evaluating AWS for scale
  7. Future of DevOps

Rootconf is a conference for practitioners, by practitioners.

The call for proposals is closed. If you are interested in speaking at Rootconf events in 2018, submit a proposal here: rootconf.talkfunnel.com/rootconf-round-the-year-2018/

##Venue:

NIMHANS Convention Centre, Lakkasandra, Hombegowda Nagar, Bengaluru, Karnataka 560029.

Schedule, event details and tickets: https://rootconf.in/2018

For more information about Rootconf, sponsorships, outstation events, contact support@hasgeek.com or call 7676332020.

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

Thripthy Antony

@thripthy

Prevent Human Errors for 99.99% Availability

Submitted Mar 5, 2018

Most often outages due to human errors get brushed under the carpet as rare occurrences, where one overworked engineer who in the middle of his 7th activity of the day, went ahead and deleted the most crucial virtual IP configuration in your landscape. But this view is many times very far from truth. Most often reliability engineers are hit from multiple sides with multiple monitoring tools and availability matrices. And her judgement goes wrong, but most often only in hindsight. At that point of time, with the available information, it was the best choice! In this session, participants will understand how human errors should be analyzed and controlled. Human errors or handling errors give enterprises a chance to consider systemic issues in the enterprise and correct them for an always available service.

Outline

I will start the session with some of the famous human errors which caused the respective organizations considerable money and loss of reputation. From there I will move on to the strategies to understand handling errors and effective methods to prevent them.

We will discuss some strategies that would help organizations and teams to analyze and prevent human errors.
• Accident prone area - Go slow
Automation fails. No matter how robust your scripts are there is chance for them to fail and your organization should be equipped to handle them manually when required, without error. While working on disaster recovery or high availability setups there is a big chance of human mistakes because of similar system names and multiple datacenters. So, add visual cues in your manuals that it an accident-prone procedure and schedule ample time.

Also, identifying such procedures helps you to plan no parallel activities and a comparatively free shift for the person executing it.

• Checklists
I cannot emphasize enough how having thoughtful checklists saves your systems. Most often it’s the mundane tasks that are missed which results in serious outages. Because everybody knows them and that the steps are comparatively simpler, they don’t get documented. This omission will come back in the form of handling error outages later. Simple important steps must be part of a checklist at each stage.

• Fix the past, Fix the present and have monitoring in place
So, what do you do when you identify a handling error. How to manage it so that it doesn’t occur again?
Most often people say someone missed something and move on. But that is not enough if you are targeting 99.999% availability for your services. A method that works, is that the problem management responsible in your organization to have an in-depth interview with the person who executed the same. A common mistake that occurs is the information comes from the manager of the team and an assurance that it will not happen again. But that is just scratching the surface of the symptom. The root cause for the error is much deeper.
Maybe there were confusing messages on the screen, maybe he was handling 3 other activities in parallel or maybe the tool did not throw an error message where it should have stopped the processor.

• Relook your shift handover plans
Spend time on key aspects like,
Is there a shift lead available?
Are there other activities planned for shift lead?
Have you defined a process for shift handover?
Is there a checklist available?

Key take away for the participants will be effective methods to handle human errors at workplace. I will also add some real life examples in each of the scenarios during the presentation.

Requirements

None

Speaker bio

I am working in Problem Management and Change Management in one of the Cloud Units at SAP as a Process Manager. I have 13 years of industry experience with a strong background in operations. We had been running a very unique initiative at my organization to control and prevent human errors to minimize production outages. I am heading the initiative now, and the insights we have gained while running this project is worth sharing with other teams and organizations looking for a zero outage service portfolio.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more