Software/Site Reliability of Distributed Systems

Sep 2019

16 Mon

17 Tue

18 Wed

19 Thu

20 Fri

21 Sat 08:55 AM – 06:20 PM IST

22 Sun

Make a submission

Accepting submissions till 21 Aug 2019, 10:30 AM

St. Laurn Hotel, Pune

Tickets

##About Rootconf Pune:

Rootconf Pune is a conference for:

DevOps engineers
Site Reliability Engineers (SRE)
Security and DevSecOps professionals
Software engineers
Network engineers

The Pune edition will cover talks on:

InfoSec and application security for DevOps programmers
DNS and TLS 1.3
SRE and distributed systems
Containers and scaling

Speakers from Flipkart, Hotstar, Red Hat, Trusting Social, Appsecco, InfraCloud Technologies, among others, will share case studies from their experiences of building security, SRE and Devops in their organizations.

##Workshops:

Two workshops will be held before and after Rootconf Pune:

Full-day Prometheus training workshop on 20 September, conducted by Goutham V, contributor to Prometheus and developer at Grafana Labs. Details about the workshop are available here: https://hasgeek.com/rootconf/2019-prometheus-training-pune/
Full-day DNS deep dive workshop on 22 September by Ashwin Murali: https://hasgeek.com/rootconf/2019-dns-deep-dive-workshop-pune/

##Event venue:
Rootconf Pune will be held on 21 September at St. Laurn Hotel, Koregaon Park, Pune-411001.

#Sponsors:

Click here to view the Sponsorship Deck.
Email sales@hasgeek.com for bulk ticket purchases, and sponsoring the above Rootconf Series.

Rootconf Pune 2019 sponsors:

#Platinum Sponsor

#Bronze Sponsors

#Community Partner

##To know more about Rootconf, check out the following resources:

hasgeek.com/rootconf
hasgeek.com/rootconf/2019
https://hasgeek.tv/rootconf/2019

For information about the event, tickets (bulk discounts automatically apply on 5+ and 10+ tickets) and speaking, call Rootconf on 7676332020 or write to info@hasgeek.com

Hosted by

Rootconf

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

All submissions

Previous Next

This submission has been added to the schedule

Software/Site Reliability of Distributed Systems

Submitted Jul 6, 2019

Section: Full talk (40 mins) Category: Distributed systems

Every product either dies a hero or lives long enough to hit Reliability issues.
Whether it’s your code or a service that you connect to, there will be a disk that will fail, a network that will experience partition, a CPU that will throttle, or a Memory that will fill up.
While you go about fixing this, What is the cost, both in terms of effort and business lost, of failure and how much does each nine of reliability cost?
The talk considers a sample and straightforward product and evaluates the depths of each failure point. We take one fault at a time and introduce incremental changes to the architecture, the product, and the support structure like monitoring and logging to detect and overcome those failures.

Outline

Consider a sample application:
A number that user sends an SMS text to of the form “Remind <date format> about <y>.” When it’s due, a service calls you back. User is charged for each SMS and reminders that they answer.

Where all do you think this can start failing?

Static Failures:

Disks
Network
CPU
Memory

Behaviour Failures:

Degradation
Latency
Freshness
Correctness
DDos

What are the right tools and strategies to measure and monitor these failure points?
What is the cost of measuring or leaving it un-measured?

There are Queues in the system. How do you monitor synchronous and asynchronous architectures?

The load has started to increase, but before we discuss strategies Let’s discuss CAP quickly.
How do we decide if we need sharding, better CPU or Clustering?

How do we add backups? Should they be asynchronous or synchronous?
Criteria to consider before picking up a strategy.

So far, we have been reactive about failures. How do we move to a proactive model?
And Meanwhile, could you trace that request from that particular user for me?

At what stage and how do we start injecting reliability as a part of the Software development process?

Lastly, while all of this is said to improve and fix things, how do we prove that it does? How do you validate that MySQL replicas come back when the master dies. The only way to know is by simulating. How do we set up Simulations? A decade ago it used to be called FMEA; now it’s called Chaos Engineering.

And oh, we should also discuss Site vs Software Reliability.

Speaker bio

I head Site Reliability Engineering at Trustingsocial.com, where we credit-score nearly half-billion users across 3 countries, 5 datacenters, 3 clouds and on our way to credit-score 1-billion people across South-East Asia.

I have been working on Infrastructure Engineering for almost a Decade, from the days when things would break they would make a sound.
I have had the fortune of learning these skills from some top engineers while scaling fairly large complex database systems like Cassandra, building an Iaas platform, or building our own microservice communication bus.

The talk is a consortium of my learnings over the past 15 years and I hope that it could help engineers/architects as well.