Rootconf 2019

On infrastructure security, DevOps and distributed systems.

Participate

SRE: Culture & Strategy

Submitted by Talina Shrotriya (@talina06) on Friday, 31 May 2019

Section: Crisp talk Technical level: Intermediate Session type: Lecture Technical level: Intermediate Section: Crisp talk of 20 mins duration

View proposal in schedule

Abstract

The work of a Site Reliability Engineer is either misconstrued or portrayed less frequently in the tech community. Most of the SRE stories get lost in daily on-call schedules. This talk opens up a gold mine by discussing the issues an SRE team faces and the solutioning done around them. The intent is to provide the audience with a set of case studies which deal with distributed environments and scale.

Outline

The work of an SRE team is to serve a single purpose of shipping code in a fast, reliable and economical manner. 4 Key Principles of SRE are - Measuring risk factors, automation, visibility and simplicity.

Infrastructure Management:

Infrastructure is the entrypoint to deploy code to production. While cloud providers do make this task easy, there are deeper Risk Factors we had to measure, such as - versioning, locking access to concurrent updation of resources and enabling webhooks. We developed a tool called Tessellate to do just this.

Scheduling:

Choosing the right scheduler for the workload and the types of services helps maintain simplicity across all deployments.
We discuss the risk factors involved in our initial scheduler deisgn, and how we solutioned a service to circumvent the risks.

Network:

We discuss network related administrative tasks, to understand why automation is an essential principle of SRE. We walkthrough 2 scenarios where we automated processes and workflows by building light weight services.

Observability:

We discuss how observability is much more than merely gathering metrics. We understand what visibility means, and see how a good monitoring solution helps us gain the right amount of visibility into production systems.

Conclusion:

The core belief of an SRE team is to solve problems for the larger good and not restrict ourselves to the problem at hand. Every single tool we use was solutioned keeping this intent in mind. Each solution was a step towards better visibility and access to production systems and each of these solutions made our on-call shifts manageable. The key takeaway from this talk would be to follow the same approach and taking a step back and thinking twice before doing something manually, thinking whether this problem is repeatable, reusable and can be automated in a simple manner.

Requirements

N/A

Speaker bio

Talina is a software engineer @ Trusting Social.
She has worked on data intensive projects primarily written in Java, using Spark.
She was recently exposed to the world of Site Reliability Engineering, where she worked on designing and implementing Monitoring and Alerting systems for a large scale infrastructure.

Links

Slides

https://docs.google.com/presentation/d/1SZb33H2x5Y9lGisQb-lNFgHi3z1P0GaDqLCh9nTGC8k/edit?usp=sharing

Comments

  • Zainab Bawa (@zainabbawa) Reviewer a month ago

    Thanks for the submission Talina. Some comments on structuring the talk:

    1. Currently, there is an anchor missing for all the stories that you are laying out. By this, I mean what is that one argument/insight by which you want to tie all these stories together.
    2. By the above logic, the talk has to be grounded in the one argument you are making. Instead of describing the stories, you have to set the context and define the problem around the takeaway you described at the end: “think(ing) whether the problem(s) (in your infra) are repeatable, reusable and can be automated.”
    3. The stories are not primary. They are the vehicles through which you make the point. Therefore, select and choose the stories such that with every story, you bring the audience back to the moot point: how do you discover (and thereafter automate) those problems which are repeatable, reusable and can be automated.
    4. Explain how has this discovery happened for Trusting Social. By this, I mean that the stories have to be ordered to guide the audience to the core argument, instead of ordering the stories randomly.
    5. What does it take – in terms of team structure, roles, and proximity to problems – to discover such problems?
    6. Also, where possible, show before-after situation for your infrastructure and/or processes following the war story and discoveries. Tell participants what is the innovation in your current processes which you consider as a win?
    7. Change the title to reflect the problem statement. The current title is uninteresting because for someone who does not know Trusting Social and its work, why is it interesting for them to hear a talk where you will share SRE War Stories? There has to be something deeper and more meaningful for participants which will help make the connection and get the community interested.
    • Talina Shrotriya (@talina06) Proposer a month ago

      Hi Zainab. Thank you for all the pointers, they were helpful. I’ve tried to accomodate your feedback in my slides. (attached in the proposal) Hope you can go through the slides and advise further.

      -Talina

  • Zainab Bawa (@zainabbawa) Reviewer a month ago

    I love the war stories, @talina06. Very well done.

    Couple of things to add:

    1. Have an introduction slide where you walk participants through the flow and what they should expect.
    2. Your slides are not finished, or so it seems. Make sure you have a conclusion slide where you summarize the key learnings.
    3. Include a contact info slide where folks can contact you if they need to get in touch.
    4. It appears to me that your strongest argument is: “Culture eats tooling for breakfast”. If this is the case, keep this point as an anchor to reinforce it throughout your talk. If not, ignore.

    Rest, let’s take feedback in the rehearsal this evening. Best of luck.

Login with Twitter or Google to leave a comment