Rootconf 2019
Rootconf For members

Rootconf 2019

On infrastructure security, DevOps and distributed systems.

Tickets

Loading…

##About Rootconf 2019:
The seventh edition of Rootconf is a two-track conference with:

  1. Security talks and tutorials in audi 1 and 2 on 21 June.
  2. Talks on DevOps, distributed systems and SRE in audi 1 and audi 2 on 22 June.

##Topics and schedule:
View full schedule here: https://hasgeek.com/rootconf/2019/schedule

Rootconf 2019 includes talks and Birds of Feather (BOF) sessions on:

  1. OSINT and its applications
  2. Key management, encryption and its costs
  3. Running a bug bounty programme in your organization
  4. PolarDB architecture as Cloud Native Architecture, developed by Alibaba Cloud
  5. Vitess
  6. SRE and running distributed teams
  7. Routing security
  8. Log analytics
  9. Enabling SRE via automated feedback loops
  10. TOR for DevOps

##Who should attend Rootconf?

  1. DevOps programmers
  2. DevOps leads
  3. Systems engineers
  4. Infrastructure security professionals and experts
  5. DevSecOps teams
  6. Cloud service providers
  7. Companies with heavy cloud usage
  8. Providers of the pieces on which an organization’s IT infrastructure runs -- monitoring, log management, alerting, etc
  9. Organizations dealing with large network systems where data must be protected
  10. VPs of engineering
  11. Engineering managers looking to optimize infrastructure and teams

For information about Rootconf and bulk ticket purchases, contact info@hasgeek.com or call 7676332020. Only community sponsorships available.

##Rootconf 2019 sponsors:

#Platinum Sponsor

CRED

#Gold Sponsors

Atlassian Endurance Trusting Social

#Silver Sponsors

Digital Ocean GO-JEK Paytm

#Bronze Sponsors

MySQL sumo logic upcloud
platform sh nilenso CloudSEK

#Exhibition Sponsor

FreeBSD Foundation

#Community Sponsors

Ansible PlanetScale

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

Aaditya Talwai

@talwai

Virtuous Cycles: Enabling SRE via automated feedback loops

Submitted May 30, 2019

Automating common operational procedures - like increasing capacity, expiring data, or evening out load on a system - is the bread-and-butter of many SRE teams. Operator nirvana is having apps that can heal themselves, without human intervention - but most SRE teams will accept some toil as an inevitable part of their lives. This is because some procedures are too risky to automate, too costly to get wrong. How do you build the confidence that your “self-healing” system will not accidentally shoot itself in the foot, while in production?

Outline

In pictures we will show a journey of instrumentation - how one can use app-level telemetry and tracing to build confidence that your auto-remediating strategies are doing the right things. Case studies include:

  • Intelligent query timeouts that allow loaded workers to recover
  • A backoff and jitter system for controlling thundering-herd on an internal service
  • Watermark-based quota system for shaping traffic on a multitenant cluster

We will show that using open-source tooling, and good observability practices, you can make an opaque part of your system that is operationally taxing into a well-behaved component, that remediates itself. We take a very visual approach to telling these stories - so expect graphs and lot of them!

Ultimately, we want to give audience a framework and strategy to answer these questions:

  • Is an ops procedure worth automating?
  • How to get good feedback from internal telemetry in your application?
  • How to use this feedback to drive auto-remediation?
  • And most importantly, how to experiment on all this, without breaking production :)

Requirements

Some prior knowledge of operating distributed systems.

Speaker bio

Aaditya Talwai is a Site Reliabilty Engineer at Confluent and former Lead Software Engineer at Datadog. His work has focused on large-scale monitoring systems and the words, pictures, and tools we use to tell stories about our software systems. At Datadog, he helped architect a cloud-scale distributed tracing and APM tool, bringing together the three pillars of observability - metrics, traces, and logs. At Confluent, he works on a unified cloud platform for event streaming, including the observability and automation strategies needed to guarantee a highly-available, elastic, multitenant cluster. He is enthusiastic about helping SRE teams understand their systems, and deploy apps that heal themselves, through great observability practices and a culture of experimentation.

Slides

https://docs.google.com/presentation/d/1kohxs_t2ZAx2ZMhaskk_5b-rxsHOrcuXI60VYtR983g/edit?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more