SRE: Culture & Strategy
Submitted by Talina Shrotriya (@talina06) on Friday, 31 May 2019
Section: Crisp talk Technical level: Intermediate Session type: Lecture Technical level: Intermediate Section: Crisp talk of 20 mins duration
The work of a Site Reliability Engineer is either misconstrued or portrayed less frequently in the tech community. Most of the SRE stories get lost in daily on-call schedules. This talk opens up a gold mine by discussing the issues an SRE team faces and the solutioning done around them. The intent is to provide the audience with a set of case studies which deal with distributed environments and scale.
The work of an SRE team is to serve a single purpose of shipping code in a fast, reliable and economical manner. 4 Key Principles of SRE are - Measuring risk factors, automation, visibility and simplicity.
Infrastructure is the entrypoint to deploy code to production. While cloud providers do make this task easy, there are deeper Risk Factors we had to measure, such as - versioning, locking access to concurrent updation of resources and enabling webhooks. We developed a tool called Tessellate to do just this.
Choosing the right scheduler for the workload and the types of services helps maintain simplicity across all deployments.
We discuss the risk factors involved in our initial scheduler deisgn, and how we solutioned a service to circumvent the risks.
We discuss network related administrative tasks, to understand why automation is an essential principle of SRE. We walkthrough 2 scenarios where we automated processes and workflows by building light weight services.
We discuss how observability is much more than merely gathering metrics. We understand what visibility means, and see how a good monitoring solution helps us gain the right amount of visibility into production systems.
The core belief of an SRE team is to solve problems for the larger good and not restrict ourselves to the problem at hand. Every single tool we use was solutioned keeping this intent in mind. Each solution was a step towards better visibility and access to production systems and each of these solutions made our on-call shifts manageable. The key takeaway from this talk would be to follow the same approach and taking a step back and thinking twice before doing something manually, thinking whether this problem is repeatable, reusable and can be automated in a simple manner.
Talina is a software engineer @ Trusting Social.
She has worked on data intensive projects primarily written in Java, using Spark.
She was recently exposed to the world of Site Reliability Engineering, where she worked on designing and implementing Monitoring and Alerting systems for a large scale infrastructure.