The Urban Myth Of Full Uptime

Jan 2020

13 Mon

14 Tue

15 Wed

16 Thu

17 Fri

18 Sat 09:00 AM – 05:40 PM IST

19 Sun

The Urban Myth Of Full Uptime

Submitted Nov 19, 2019

Section: Full talk (40 mins) Category: SRE

Strategies to achieve high uptime at scale. The points this talk is going to cover are:

About Me, My Company and My Situation
- Set context for the rest of the headlines
- Touch up on the legacy setup and infrastructure so that people can put the upcoming points in contrast with it
Causes of our downtime
- Architecture
- Poor Provisioning Practices
  - Hardware
  - Configuration
- Lack of Monitoring
- Missing Backups, DR and BC
- Poor Technical Choices
  - Storing Data on single node
  - Scaling storage with LVM
  - Node local cache for distributed apps
  - Cyclic API calls
- Security
  - Checked in secrets
  - Publicly accessible resources
  - Outdated and vulnerable versions of tools
- Lack of Documentation and Testing
- Takeaway - Typical problems faced in a poorly architected infrastructure
Architecture
- What’s wrong with it
- Designing immutable infrastructure
Poor Provisioning Practices
- What’s wrong with it
- Provisioning immutable resources with Terraform
- Deploying and Configuring services in immutable fashion
Monitoring
- What’s wrong with it
- Implementing Observability
Backups, DR and BC
- What’s wrong with it
- Automated backups with redundant copies
Poor Technical Choices
- What’s wrong with it
- Fixing the mistakes made so far
Lack of Documentation and Testing
Summary

Rootconf Delhi edition