From outages to internals: understanding the physics of distributed systems

From outages to internals: understanding the physics of distributed systems

Hands-on workshop for SREs and senior engineers who want to understand how systems like Kubernetes and Kafka work under the hood.

Tickets

Loading…

Workshop overview

Every SRE has faced a mysterious outage. The root cause often isn’t in the application code, but in the fundamental physics of the underlying system. This workshop is for senior developers, SREs, DevOps engineers who want to move from being users of systems like Kubernetes and Kafka to understanding their internals.

Part 1: The physics of systems & failure (120 Minutes)

Module 1: The four fundamental resources

  • Understanding four fundamental resources — CPU, Memory, Disk, and Network.
  • Saturation and its impact on throughput and latency.
  • Little’s law and understanding the latency / throughput impact of saturation.
  • Saturation Lab: Observe disk saturation. Do capacity planning for disk-heavy systems.

Module 2: Failures in real cloud systems and their solutions

  • Failure probability calculations for cloud setups.
  • Fundamental patterns to mask failures.
  • Write Ahead Log (WAL): Implement a simple WAL.
  • Quorum Intersection: Experiment with different quorum configurations and its impact on consistency.
  • Generation Clock: Understand the rationale behind Raft’s ‘term’ or Paxos’s ‘ballot’.

Part 2: Building blocks of distributed systems (120 Minutes)

Looking at Kafka and Kubernetes to see how the building blocks look like.

  1. Consistent Core
  2. Leases
    • Implement group membership with Zookeeper (similar to Kafka).
    • Implement group membership with etcd (similar to Kubernetes).
  3. State Watch
  • Implement watches for topic and node metadata changes in Zookeeper.
  • Implement watches for node and pod metadata changes in etcd.
  1. Open Q&A & Deeper Dive (Optional Content)
    • You might be wondering how etcd and Zookeeper guarantee consistency.
  • That’s where consensus algorithms like Raft and ZAB come in.
  • We will briefly understand what it takes to implement something like etcd with the Raft consensus algorithm.

Key takeaways

  1. System performance is governed by its most saturated resource (the bottleneck).
  2. Redundancy is the key to high availability.
  3. Consistent cores (like ZK/etcd) provide reliable building blocks — consistent view of configuration information, leases for membership and watches for notifications — that enable complex systems.

About the instructor

Unmesh Joshi is a Distinguished Engineer at Thoughtworks. He is a software architecture enthusiast, who believes that understanding principles of distributed systems is as essential today as understanding web architecture or object-oriented programming was in the last decade. For the last two years he has been publishing patterns of distributed systems on martinfowler.com.
In 2023, he authored the book Patterns of Distributed Systems published by Addison Wesley Professional. This book is an essential catalog of patterns aimed at enhancing comprehension, communication and education on distributed system design
He has also conducted various training sessions around this topic. Twitter: @unmeshjoshi

How to attend this workshop

This workshop is open for participation to Rootconf annual members.
This workshop is open to 30 participants only. Seats will be available on first-come-first-serve basis. 🎟️

Contact information ☎️

For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com

Venue

Aerospike India Private Limited

7th Floor, Indiqube Techpoint,

30, 100 Feet Rd, Srinivagilu, AVS Layout, Koramangala,

Bengaluru, - 560034

Karnataka, IN

Loading…

Hosted by

We care about site reliability, cloud costs, security and data privacy

Supported by

Venue host

Aerospike is a real-time, distributed NoSQL database designed for high-throughput, low-latency workloads where millisecond response matters