Rootconf Hyderabad edition

Rootconf Hyderabad edition

On SRE, systems engineering and distributed systems

Make a submission

Accepting submissions till 30 Sep 2019, 11:59 PM

T-Hub, Hyderabad

About Rootconf Hyderabad:

Rootconf Hyderabad is a platform for:

  1. DevOps engineers
  2. Site Reliability Engineers (SRE)
  3. ML and data engineers
  4. Security and DevSecOps professionals
  5. Software engineers

to discuss real-world problems around:

  1. Site Reliability Engineering (SRE)
  2. Data and AI engineering
  3. Distributed systems – observerability, microservices
  4. Implementing Infrastructure as Code

Speakers from Flipkart, Hotstar, Intuit, GO-JEK, MadStreetDen and Trusting Social will share their experiences with the above challenges.

Event venue:

Rootconf Hyderabad will be held at T-Hub, IIIT-Hyderabad Campus, Gachibowli, Hyderabad, Telangana - 500032

Contact information:

For bulk ticket purchases,sponsorship and other inquiries, contact sales@hasgeek.com or call 7676332020

Sponsors:

Click here to view the Sponsorship Deck.


Rootconf Hyderabad 2019 sponsors:


Platinum Sponsor

Atlassian

Bronze Sponsors

upcloud Elastic Hashicorp

For information about the event, tickets (bulk discounts automatically apply on 5+ and 10+ tickets) and speaking, call Rootconf on 7676332020 or write to info@hasgeek.com.

Hosted by

Rootconf is a forum for discussions about DevOps, infrastructure management, IT operations, systems engineering, SRE and security (from infrastructure defence perspective). more

Geethanjali Eswaran

@geethanjalieswaran

Consensus problem in Distributed Systems

Submitted Sep 3, 2019

A fundamental problem in a distributed system is obtaining consensus on some data value to achieve overall system reliability on top of unreliable system components. In the real world, system components are never perfect, they are prone to hardware failures, packet drops, slow network, clock skews, etc and in this talk, let’s walk through a few common scenarios in a distributed system where all the components should agree on the state of the system for it to be reliable.

AUDIENCE
Aspiring Distributed Systems Developers; Technical; Beginner

KEYWORDS
Distributed System, Coordination service, Consensus Problem in Distributed Systems

Outline

What is a consensus in a distributed system?

In the context of distributed systems design, a consensus is often loosely used to mean some form of agreement. Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result).

i.e 2f+1 nodes to survive f failed nodes

There are a few properties we expect from a solution to consensus:
Agreement: Every correct process must agree on the same value.
Validity: If all processes propose the same value v, then all correct processes decide v
Termination: Every correct process decides some value. If the protocol never terminates, then the processes are vacuously agreeing on the same thing, which is not deciding.

To summarize, fundamentally, the goal of consensus is not that of the negotiation of an optimal value of some kind, but just the collective agreement on some value that was previously proposed by one of the participating servers in that round of the consensus algorithm. With the help of consensus, the distributed system is made to act as though it were a single entity.

An example scenario:

For the purpose of simplicity, let’s assume a distributed storage system with 2f+1 nodes participating to form a cluster and these participants act at their own speed, may fail at any time and rejoin after recovering from the failure. And these nodes are connected via a network which transmits messages asynchronously at an arbitrary speed. In short, everything can fail at any time; after failure, participants can recover and rejoin the system. Yes, we are looking at a fault tolerant storage system. As these nodes can fail at various stages, it’s important to have more than one copy of our data. For now, let’s assume all the data is replicated across all the cluster nodes (but in reality it may affect overall performance)
And we have a client which is not part of the cluster, requesting for some operation from our distributed storage, like a write or read to a data file. Read operation can be served by any node in our cluster without any issues, but write has to be agreed upon by all cluster members before the write can be committed. If two or more nodes recieve write request at the same time for the same value, how to determine which request to process in a distributed setup? This is an example of consensus problem in distributed systems.

Via this talk, let me introduce some prominent consensus algorithms to obtain consensus in a distributed systems.

Requirements

Basic knowledge on distributed systems

Speaker bio

Geethanjali Eswaran, DevOps Engineer for Large-Scale Data Cloud in Salesforce. Passionate about distributed computing, BigData cloud, Apache projects, Kerberos protocol, and many more…

Links

Slides

https://docs.google.com/presentation/d/1aAY7uZms0dJZ7x1PWgGnh_ZAipJ-GA5Zcz9vAapOYpw

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Make a submission

Accepting submissions till 30 Sep 2019, 11:59 PM

T-Hub, Hyderabad

Hosted by

Rootconf is a forum for discussions about DevOps, infrastructure management, IT operations, systems engineering, SRE and security (from infrastructure defence perspective). more