Consensus problem in Distributed Systems

This submission has been added to the schedule

Consensus problem in Distributed Systems

Submitted Sep 3, 2019

Section: Flash talk (5 mins) Category: Distributed systems

A fundamental problem in a distributed system is obtaining consensus on some data value to achieve overall system reliability on top of unreliable system components. In the real world, system components are never perfect, they are prone to hardware failures, packet drops, slow network, clock skews, etc and in this talk, let’s walk through a few common scenarios in a distributed system where all the components should agree on the state of the system for it to be reliable.

AUDIENCE
Aspiring Distributed Systems Developers; Technical; Beginner

KEYWORDS
Distributed System, Coordination service, Consensus Problem in Distributed Systems

Outline

What is a consensus in a distributed system?

In the context of distributed systems design, a consensus is often loosely used to mean some form of agreement. Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result).

i.e 2f+1 nodes to survive f failed nodes

There are a few properties we expect from a solution to consensus:
Agreement: Every correct process must agree on the same value.
Validity: If all processes propose the same value v, then all correct processes decide v
Termination: Every correct process decides some value. If the protocol never terminates, then the processes are vacuously agreeing on the same thing, which is not deciding.

To summarize, fundamentally, the goal of consensus is not that of the negotiation of an optimal value of some kind, but just the collective agreement on some value that was previously proposed by one of the participating servers in that round of the consensus algorithm. With the help of consensus, the distributed system is made to act as though it were a single entity.

An example scenario:

For the purpose of simplicity, let’s assume a distributed storage system with 2f+1 nodes participating to form a cluster and these participants act at their own speed, may fail at any time and rejoin after recovering from the failure. And these nodes are connected via a network which transmits messages asynchronously at an arbitrary speed. In short, everything can fail at any time; after failure, participants can recover and rejoin the system. Yes, we are looking at a fault tolerant storage system. As these nodes can fail at various stages, it’s important to have more than one copy of our data. For now, let’s assume all the data is replicated across all the cluster nodes (but in reality it may affect overall performance)
And we have a client which is not part of the cluster, requesting for some operation from our distributed storage, like a write or read to a data file. Read operation can be served by any node in our cluster without any issues, but write has to be agreed upon by all cluster members before the write can be committed. If two or more nodes recieve write request at the same time for the same value, how to determine which request to process in a distributed setup? This is an example of consensus problem in distributed systems.

Via this talk, let me introduce some prominent consensus algorithms to obtain consensus in a distributed systems.

https://docs.google.com/document/d/1t25KUb7Ys2JS8_pBrN32MHHDFuEiPsUF3nanWcF1Nkw/edit?usp=sharing

Slides

https://docs.google.com/presentation/d/1aAY7uZms0dJZ7x1PWgGnh_ZAipJ-GA5Zcz9vAapOYpw

Rootconf Hyderabad edition