Paxos consensus for beginners (2)

Published in

Distributed Knowledge

5 min readNov 3, 2020

This is the second article about the introduction of Paxos — a consensus algorithm for distributed system replication; it’s highly recommended to read the first one before continuing.

In the last article, I talked about the necessary steps in Paxos(for a single value) by a series of counterexamples; but they happen mainly due to the network(like message loss, delay or out of order). In the following paragraph, I’m going to give a brief overview of what would happen in Paxos when process failures/crashes occur; it’s strongly recommended to have a read on Paul[4], which the following is based on.

Failure in Paxos

Basically, there are the following endings(cases) for proposer failures:

The client sending the request has backuped it and set a timeout; after the timeout, the client would send the request again, probably to a different proposer(typically, Paxos is used as a library to provide a fault-tolerant service to the upper layer, like a distributed key-value application, which can be regarded as “clients” to Paxos; since the server contacted could fail for an indefinite amount of time, there needs to be some cooperation in the client side);
If the proposer has stored the value and proposal number, then when it resumes, it can pick up them and run a round for it again (a value already accepted by a majority can’t be overwritten, so a duplicate proposal won’t harm);
There are other alive proposers that either propagate the value X of the failed one (hearing from acceptors that accept X) or propose their own values (when not knowing from the minority acceptors that accept X).

And when a proposer fails

before receiving the client request -> case 1;
after receiving the client request but before being accepted by any acceptor-> case 1 or case 2, since no other process can know its value;
after being accepted by at least one acceptor -> case 1, 2 or 3.

For acceptors, whenever a majority of them are still on(and not work maliciously), the Paxos algorithm could keep going.

Note that learners are not involved in the value decision process (they are to learn the value but not decide it), so a learner failure just means that it could not know the value when it’s off; when it resumes, it could either ask the proposer for the chosen value or the proposer would tell it.

Paxos for Multiple Values

Remember our ultimate goal is to let all the replicas see the same sequence of values. The extension of (basic) Paxos for this is called Multi-Paxos. Basically, it means we need a basic Paxos for each element in the sequence. Each replica has a log to store the sequence it sees so far, and the log index is attached in the prepare/promise/accept?/accepted messages to distinguish different instances of basic Paxos (e.g. in the basic Paxos for the first sequence element, log index 1 is attached to those 4 messages, and in the basic Paxos for the second sequence element, log index 2 is attached to those 4 messages). The log of one replica could be different from that of another; what a replica can know from the log are:

Whether a value is chosen for a log index. If not, then it can propose a value with this index;
If a value is chosen for the index, it can finally know the chosen value for this index.

When a proposer A receives a client request X, it would

Find the first entry known unchosen (from A’s perspective), and run basic Paxos with that index and the client request;
If the promise message indicates that there is already an accepted value Y for this index, then A writes Y to the log entry, mark it as chosen, and start from step 1 again; otherwise, A would continue proposing X.

A possible case is shown in the following:

Here X isjmp, and A isS1. Since index 1 and 2 are both known chosen by S1, they are skipped. Index 3 is the first one known unchosen; so S1 would start a basic Paxos proposing {index = 3, value = jmp} (note that S1 doesn’t immediately replace cmp with jmp). It then finds that cmp is already accepted so it would start over again; the situation is similar for index 4. Finally, S1 succeeds in proposing jmp on index 5.

As you may find out, basic Paxos instances for different indexes don’t affect each other; so a proposer can receive the next client request and start a new basic Paxos (with the next index known unchosen) before the finish of the last request; i.e. proposing request concurrently (there could be a thread/coroutine for each client request). Still, the sequence is typically an operation sequence that changes the state of a process/replica; the execution of these operations must be executed sequentially as the order of index.

Congratulations! Now you already have a basic idea of Paxos. However, there is still a long road to make Paxos useful in practice; the current version is very inefficient for choosing a sequence of values (two broadcasts) and propagating the chosen values (proposers don’t behave actively to let all learners know the chosen value). Since these two short articles only aim to provide a basic idea of Paxos, for basic efficiency issues please refer Diego[2] (starting from 33:00), for practical issues please refer to papers like Chandra[1] and Kirsch[3].

Try Yourself

There are some free, good resources for you to implement a toy demo for Paxos:

(Simpler one) A simulation of Multi-Paxos on a single machine from University of Chicago
MIT 2015 6.824 Lab 3 to 5: in lab 3, you will first need to build Multi-Paxos, and then a key-value storage service based on it (i.e. the client request is a put/get/append operation); in lab 4, you will need to extend lab 3 into a sharded key-value storage; in lab 5, you will need to add persistence into Paxos so that replicas can resume after the crash.

For the second one, you should first follow the instruction mentioned in this.

In the end, sorry for posting the second article so late, and really thank you to all the readers! Especial thanks for Dung Le.

Reference

[1] Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. “Paxos made live: an engineering perspective.” Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. 2007.

[2] Diego Ongaro, Stanford University’s Paxos lecture, 2013

[3] Kirsch, Jonathan, and Yair Amir. “Paxos for system builders: An overview.” Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware. 2008.

[4] Paul Krzyzanowski, Rutgers University CS417 Paxos note, 2018

Paxos consensus for beginners (2)

Failure in Paxos

Paxos for Multiple Values

Try Yourself

Reference

Written by Ziliang Lin