This is the second article about the introduction of Paxos — a consensus algorithm for distributed system replication; it’s highly recommended to read the first one before continuing.
In the last article, I talked about the necessary steps in Paxos(for a single value) by a series of counterexamples; but they happen mainly due to the network(like message loss, delay or out of order). In the following paragraph, I’m going to give a brief overview of what would happen in Paxos when process failures/crashes occur; it’s strongly recommended to have a read on Paul, which the following is based on.
Basically, there are the following endings(cases) for proposer…
Since its first publication The part-time parliament by Leslie Lamport in 1989, Paxos has been the core of distributed consensus algorithms and is notoriously difficult to understand. This passage aims to overcome this with simple explanations and examples.
The very origin is that we want to make a single process fault-tolerant. Just as what is done to prevent data loss, redundancy is the most direct way — replicating the process with identical ones (typically distributed over a cluster), so that when some replicas fail, the others can still provide the service.
An obvious issue is that we need to make all the members in the group see the same client request sequence to ensure that no matter which process the client requests, the data returned is always consistent. Could we just let one process to be the “leader”, i.e., responsible for receiving all the client requests and broadcast to the others? Well, this is possible, but first think about the case where the leader crashes and a new leader is elected, and then the “old” leader resumes. There could be multiple leaders trying to convince the followers. And this is why we need Paxos — to reach a consensus on what the next client request is in the sequence. …