This is the second article about the introduction of Paxos — a consensus algorithm for distributed system replication; it’s highly recommended to read the first one before continuing.
In the last article, I talked about the necessary steps in Paxos(for a single value) by a series of counterexamples; but they happen mainly due to the network(like message loss, delay or out of order). In the following paragraph, I’m going to give a brief overview of what would happen in Paxos when process failures/crashes occur; it’s strongly recommended to have a read on Paul, which the following is based on.
Since its first publication The part-time parliament by Leslie Lamport in 1989, Paxos has been the core of distributed consensus algorithms and is notoriously difficult to understand. This passage aims to overcome this with simple explanations and examples.
The very origin is that we want to make a single process fault-tolerant. Just as what is done to prevent data loss, redundancy is the most direct way — replicating the process with identical ones (typically distributed over a cluster), so that when some replicas fail, the others can still provide the service.
An obvious issue is that we need to make…