Virtuous Cycles: Enabling SRE via automated feedback loops

This submission has been added to the schedule

Virtuous Cycles: Enabling SRE via automated feedback loops

Submitted May 30, 2019

Section: Full talk of 40 mins duration Technical level: Intermediate Section: Full talk Technical level: Intermediate Session type: Lecture

Automating common operational procedures - like increasing capacity, expiring data, or evening out load on a system - is the bread-and-butter of many SRE teams. Operator nirvana is having apps that can heal themselves, without human intervention - but most SRE teams will accept some toil as an inevitable part of their lives. This is because some procedures are too risky to automate, too costly to get wrong. How do you build the confidence that your “self-healing” system will not accidentally shoot itself in the foot, while in production?

Outline

In pictures we will show a journey of instrumentation - how one can use app-level telemetry and tracing to build confidence that your auto-remediating strategies are doing the right things. Case studies include:

Intelligent query timeouts that allow loaded workers to recover
A backoff and jitter system for controlling thundering-herd on an internal service
Watermark-based quota system for shaping traffic on a multitenant cluster

We will show that using open-source tooling, and good observability practices, you can make an opaque part of your system that is operationally taxing into a well-behaved component, that remediates itself. We take a very visual approach to telling these stories - so expect graphs and lot of them!

Ultimately, we want to give audience a framework and strategy to answer these questions:

Is an ops procedure worth automating?
How to get good feedback from internal telemetry in your application?
How to use this feedback to drive auto-remediation?
And most importantly, how to experiment on all this, without breaking production :)

Requirements

Some prior knowledge of operating distributed systems.

Speaker bio

Aaditya Talwai is a Site Reliabilty Engineer at Confluent and former Lead Software Engineer at Datadog. His work has focused on large-scale monitoring systems and the words, pictures, and tools we use to tell stories about our software systems. At Datadog, he helped architect a cloud-scale distributed tracing and APM tool, bringing together the three pillars of observability - metrics, traces, and logs. At Confluent, he works on a unified cloud platform for event streaming, including the observability and automation strategies needed to guarantee a highly-available, elastic, multitenant cluster. He is enthusiastic about helping SRE teams understand their systems, and deploy apps that heal themselves, through great observability practices and a culture of experimentation.

Zainab Bawa

@zainabbawa Editor & Promoter
This is an interesting talk! Do you have anti-patterns to share here, too?

Posted 5 years ago
Share
Copy link
Email
Twitter
Facebook
Linkedin
- AT
  
  Aaditya Talwai
  
  @talwai Submitter
  Hi Zainab, I can certainly give some time to discussing anti-patterns or common gotchas in implementing autoremediation. Some things definitely took us a few painful tries before getting them right.
  
  Posted 5 years ago
  
  Share
  Copy link
  Email
  Twitter
  Facebook
  Linkedin
  - Zainab Bawa
    
    @zainabbawa Editor & Promoter
    
    Great. I have confirmed your talk and will update you about a pre-event rehearsal by tomorrow. Meanwhile, it will help if you revised your slides to include pain points and gotchas. Participants want to learn as much from anti-patterns, as they want to from patterns.
    
    Posted 5 years ago
    
    Share
    Copy link
    Email
    Twitter
    Facebook
    Linkedin
    
    AT
    
    Aaditya Talwai
    
    @talwai Submitter
    
    Great, will do. Thanks Zainab
    
    Posted 5 years ago
    
    Share
    Copy link
    Email
    Twitter
    Facebook
    Linkedin

Rootconf 2019

Virtuous Cycles: Enabling SRE via automated feedback loops

Outline

Requirements

Speaker bio

Links

Slides

Comments

Zainab Bawa

@zainabbawa Editor & Promoter

Aaditya Talwai

@talwai Submitter

Zainab Bawa

@zainabbawa Editor & Promoter

Aaditya Talwai

@talwai Submitter