SRE: Culture & Strategy

Jun 2019

17 Mon

18 Tue

19 Wed

20 Thu

21 Fri 08:45 AM – 05:40 PM IST

22 Sat 09:00 AM – 05:30 PM IST

23 Sun

Make a submission

NIMHANS Convention Centre, Bangalore

Tickets

##About Rootconf 2019:
The seventh edition of Rootconf is a two-track conference with:

Security talks and tutorials in audi 1 and 2 on 21 June.
Talks on DevOps, distributed systems and SRE in audi 1 and audi 2 on 22 June.

##Topics and schedule:
View full schedule here: https://hasgeek.com/rootconf/2019/schedule

Rootconf 2019 includes talks and Birds of Feather (BOF) sessions on:

##Who should attend Rootconf?

DevOps programmers
DevOps leads
Systems engineers
Infrastructure security professionals and experts
DevSecOps teams
Cloud service providers
Companies with heavy cloud usage
Providers of the pieces on which an organization’s IT infrastructure runs -- monitoring, log management, alerting, etc
Organizations dealing with large network systems where data must be protected
VPs of engineering
Engineering managers looking to optimize infrastructure and teams

For information about Rootconf and bulk ticket purchases, contact info@hasgeek.com or call 7676332020. Only community sponsorships available.

##Rootconf 2019 sponsors:

#Platinum Sponsor

#Gold Sponsors

#Silver Sponsors

#Bronze Sponsors

#Exhibition Sponsor

#Community Sponsors

Hosted by

Rootconf

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

All submissions

Previous Next

This submission has been added to the schedule

SRE: Culture & Strategy

Submitted May 31, 2019

Section: Crisp talk of 20 mins duration Technical level: Intermediate Section: Crisp talk Technical level: Intermediate Session type: Lecture

The work of a Site Reliability Engineer is either misconstrued or portrayed less frequently in the tech community. Most of the SRE stories get lost in daily on-call schedules. This talk opens up a gold mine by discussing the issues an SRE team faces and the solutioning done around them. The intent is to provide the audience with a set of case studies which deal with distributed environments and scale.

Outline

The work of an SRE team is to serve a single purpose of shipping code in a fast, reliable and economical manner. 4 Key Principles of SRE are - Measuring risk factors, automation, visibility and simplicity.

Infrastructure Management:

Infrastructure is the entrypoint to deploy code to production. While cloud providers do make this task easy, there are deeper Risk Factors we had to measure, such as - versioning, locking access to concurrent updation of resources and enabling webhooks. We developed a tool called Tessellate to do just this.

Scheduling:

Choosing the right scheduler for the workload and the types of services helps maintain simplicity across all deployments.
We discuss the risk factors involved in our initial scheduler deisgn, and how we solutioned a service to circumvent the risks.

Network:

We discuss network related administrative tasks, to understand why automation is an essential principle of SRE. We walkthrough 2 scenarios where we automated processes and workflows by building light weight services.

Observability:

We discuss how observability is much more than merely gathering metrics. We understand what visibility means, and see how a good monitoring solution helps us gain the right amount of visibility into production systems.

Conclusion:

The core belief of an SRE team is to solve problems for the larger good and not restrict ourselves to the problem at hand. Every single tool we use was solutioned keeping this intent in mind. Each solution was a step towards better visibility and access to production systems and each of these solutions made our on-call shifts manageable. The key takeaway from this talk would be to follow the same approach and taking a step back and thinking twice before doing something manually, thinking whether this problem is repeatable, reusable and can be automated in a simple manner.

Requirements

N/A

Speaker bio

Talina is a software engineer @ Trusting Social.
She has worked on data intensive projects primarily written in Java, using Spark.
She was recently exposed to the world of Site Reliability Engineering, where she worked on designing and implementing Monitoring and Alerting systems for a large scale infrastructure.

Slides

https://docs.google.com/presentation/d/1SZb33H2x5Y9lGisQb-lNFgHi3z1P0GaDqLCh9nTGC8k/edit?usp=sharing