Distributed Consensus and Data Safety: NewSQL Perspective
Submitted by Vijay Srinivas Agneeswaran, Ph.D (@vijayagneeswaran) on Tuesday, 18 April 2017
We explore data safety issues in designing large distributed systems. Though data safety issues have been addressed in traditional complex software systems such as aircraft engineering systems, ensuring data safety in distributed systems is a complex and arduous task. The complexity is due to necessity to ensure safety of various data such as configuration data, state changes at individual nodes, global state changes etc. Further, ensuring consistency of global state as well as the verification and validation of all the above data is required. We explore formal verification of the safety properties of distributed systems through recent work on IronFleet (http://sigops.org/sosp/sosp15/current/2015-Monterey/250-hawblitzel-online.pdf).
We start from distributed consensus problem and explain how it can be defined interestingly using the parable of La Tryste, leading to the Fischer, Lynch and Peterson’s impossibility result. We then illustrate conditions/assumptions under which consensus is possible. We discuss how failure detectors can be used to solve consensus. We go on to discuss Paxos algorithms and its various formulations and variations/simplifications. We talk about the CAP theorem and illustrate choices made by different NoSQL systems in this respect. We then present the commit protocols as variations of distributed consensus and illustrate their importance for data safety.
We explore the different kinds of NewSQL datastores which have emerged in the last few years and tackle data safety by providing ACID consistency of distributed state across large collection of nodes. We briefly outline systems such as Google Spanner , Clusterix, VoltDB, NimbusDB etc.
We outline how Google Spanner on the other hand provides ACID consistency across a wide-area based distributed system. It provides a strict form of consistency known as Linearizability . It is the first system to do so across a WAN. Spanner assigns global timestamps to transactions across a distributed set of nodes; timestamps reflect serialization order. The key to Spanner’s global timestamps are its TrueTime API and its implementation. The TrueTime API abstracts and exposes clock uncertainty and allows applications to reason with uncertainty, while the TrueTime API implementation in Google’s datacenters restricts the uncertainty to less than 10 milliseconds. The uncertainty is small compared to say NTP where the deltas between different clocks across a distributed system can be as high as 250 milliseconds. Google’s TrueTime API implementation has achieved that by having two physical clocks on each node: atomic and GPS. F1, an advertisement backend built by Google is the first “client” of Google Spanner and used it in production. In essence, Spanner provides the following features: semi-relational tables, query language based on SQL, elasticity and the notion of ACID transactions.
Another interesting aspect that should be kept in mind while designing large distributed systems is that most existing algorithms, including Paxos and those used in Google Spanner do not solve the Byzantine consensus problem . Byzantine consensus is a formulation of the consensus problem with extreme behavior attributed to nodes, allowing reasoning about difficult real-world conditions such as software bugs. One may have to explore block chain  kind of technologies to solve Byzantine consensus.
For more details, please see my blog in the ACM blog site:
 James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s Globally-Distributed Database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI‘12). USENIX Association, Berkeley, CA, USA, 251-264.
 Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems 12, 3 (July 1990), 463-492.
 Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (June 2002), 51-59. DOI: https://doi.org/10.1145/564585.564601
 Butler Lampson. 2001. The ABCD’s of Paxos. In Proceedings of the twentieth annual ACM symposium on Principles of distributed computing (PODC ‘01). ACM, New York, NY, USA, 13-. DOI=http://dx.doi.org/10.1145/383962.383969.
 Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks (DSN ‘11). IEEE Computer Society, Washington, DC, USA, 245-256.
 Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC‘14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.
 Leslie Lamport, Robert Shostak, and Marshall Pease. 1982. The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4, 3 (July 1982), 382-401. DOI=http://dx.doi.org/10.1145/357172.357176.
 Crosby, M., Pattanayak, P., Verma, S., & Kalyanaraman, V. (2016). Blockchain technology: Beyond bitcoin. Applied Innovation, 2, 6-10.
- Data safety issues in distributed systems
- Overview of Paxos and Distribtued consensus algorithms
- NewSQL datastores - brief on GoogleSpanner, Clusterix, NimbusDB etc.
- IronFleet and formal verification of safety properties of distributed systems.
Fundamentals of distributed systems.
Dr. Vijay Srinivas Agneeswaran has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001, a PhD from IIT Madras (2008) and a post-doctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL). He has joined as Director of Technology in the data sciences team of SapientNitro. He has spent the last ten years creating intellectual property and building products in the big data area in Oracle, Cognizant and Impetus. He has built PMML support into Spark/Storm and realized several machine learning algorithms such as LDA, Random Forests over Spark. He led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN. He and his team have also built the first distributed deep learning framework on Spark. He is a professional member of the ACM and the IEEE (Senior) for the last 10+ years. He has four full US patents and has published in leading journals and conferences, including IEEE transactions. His research interests include distributed systems, data sciences as well as Big-Data and other emerging technologies. He has been an invited speaker in several national and International conferences such as O’Reilly’s Strata Big-data conference series. He will also be speaking at the Strata Big-data conference in London in May 2017. He also gave a keynote speech at the Fifth Elephant conference in 2014. He lives in Bangalore with his wife, son and daughter and enjoys researching history and philosophy of Egypt, Babylonia, Greece and India.