Continuous Reliability - CFP

Continuous Reliability - CFP

Expert Talks on all things DevOps & SRE

Arvind Saraf

@arvinds

Bringing in SRE practice - choices, stack, evaluation & metrics. An Engineering leader's perspective

Submitted Nov 21, 2023

Engineering systems need to be reliable & high uptime for best customer experience. Often, these uptime commitments are part of customer constracts (SLA). Good systems rely on adequate logs & metrics for timely alerting on potentially aggravating situations & helping an engineer fix them promptly. This is constantly at odds with a startups goal of building & shipping new features. Often, engineers or engineering leaders may lack adequate experience to build such practice effectively & frugally.

The talk covers the broad practice of building the SRE culture & choices within an organization, from an Engineering leader’s perspective, covering:

  1. Defining the system/customer facing SRE metrics (SLOs & SLIs), and mapping them to internal individual system metrics. The different kinds of systems & their respective metrics.
  2. SRE stack choices, comparison & evaluation, including open-source or paid ones - eg LGTM (or with Jaegar instead of Tempo), ELK, etc.
  3. Bringing monitoring into the engineering culture - right from new design docs. On-call mechanisms & options.
  4. How engineering & overall management can help foster this culture - incident reports, dashboards. Balancing SRE vs feature building.
  5. Right skills for SRE, hiring or acquiring for these skills.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy