AetherWorks
Find out more about us on our homepage at aetherworks.com
AetherStore
Turn the spare storage space on your office workstations into a shared storage system.
Download AetherStore
Become an early adopter and download AetherStore beta at aetherstore.com!

In distributed systems, ensuring a set of machines are able to reach consensus on a decision is extremely important. Database systems, for example, need to ensure that all machines agree whether to commit or rollback a transaction, so that each machine holds a consistent view of its data (imagine the problems your bank would have if one machine thought you had $1000 in your account, but another thought that you had $0).

Consensus is difficult to achieve, because messages between machines can be lost or indefinitely delayed, or the machines themselves can fail — how does one machine know whether another machine has processed a message?

Two-phase commit is commonly used to provide consensus, but it suffers from a single point of failure — if the node coordinating a transaction fails, the system must block until this node restarts. The three-phase commit protocol removes the blocking problem, but is still reliant on a single coordinator.

This post discusses Paxos, a distributed alternative to two- and three-phase commit. Paxos guarantees that nodes will only ever choose a single value (meaning it guarantees safety), but does not guarantee that a value will be chosen if a majority of nodes are unavailable (progress). Consensus algorithms aim for safety because it doesn’t matter whether we commit or rollback — neither is more correct than the other — but it is of critical importance that only one answer is chosen.

General Approach

Paxos node can take on any or all of three roles: proposeracceptor, and learner. A proposer proposes a value that it wants agreement upon. It does this by sending a proposal containing a value to the set of all acceptors, which decide whether to accept the value. Each acceptor chooses a value independently — it may receive multiple proposals, each from a different proposer — and sends its decision to learners, which determine whether any value has been accepted. For a value to be accepted by Paxos, a majority of acceptors must choose the same value. In practice, a single node may take on many or all of these roles, but in the examples in this section each role is run on a separate node, as illustrated below.

Figure 1: Basic Paxos architecture. A number of proposers make proposals to acceptors. When an acceptor accepts a value it sends the result to learner nodes.

Paxos By Example

In the standard Paxos algorithm proposers send two types of messages to acceptors: prepare and accept requests. In the first stage of this algorithm a proposer sends a prepare request to each acceptor containing a proposed value, v, and a proposal number, n. The proposed value can be commit or rollback, as in the previous database example, or it can be any other arbitrary value. In this post the Paxos nodes are attempting to reach consensus on an integer value. Each proposer’s proposal number must be a positive, monotonically increasing, unique, natural number, with respect to other proposers’ proposal numbers[1].

In the example illustrated below, there are two proposers, both making prepare requests. The request from proposer A reaches acceptors X and Y before the request from proposer B, but the request from proposer B reaches acceptor Z first.

Figure 2: Paxos. Proposers A and B each send prepare requests to every acceptor. In this example proposer A’s request reaches acceptors X and Y first, and proposer B’s request reaches acceptor Z first.

If the acceptor receiving a prepare request has not seen another proposal, the acceptor responds with a prepare response which promises never to accept another proposal with a lower proposal number. This is illustrated in Figure 3 below, which shows the responses from each acceptor to the first prepare request they receive.

Figure 3: Paxos. Each acceptor responds to the first prepare request message that it receives.

Eventually, acceptor Z receives proposer A’s request[2], and acceptors X and Y receive proposer B’s request. If the acceptor has already seen a request with a higher proposal number, the prepare request is ignored, as is the case with proposer A’s request to acceptor Z. If the acceptor has not seen a higher numbered request, it again promises to ignore any requests with lower proposal numbers, and sends back the previous highest proposal number that it has seen along with the value of that proposal. This is the case with proposer B’s request to acceptors X and Y, as illustrated below:

Figure 4: Paxos. Acceptor Z ignores proposer A’s request because it has already seen a higher numbered proposal (4 > 2). Acceptors X and Y respond to proposer B’s request with the previous highest request that they acknowledged, and a promise to ignore any lower numbered proposals.

Once a proposer has received prepare responses from a majority of acceptors it can issue an accept request. Since proposer A only received responses indicating that there were no previous proposals, it sends an accept request to every acceptor with the same proposal number and value as its initial proposal (n=2, v=8). However, these requests are ignored by every acceptor because they have all promised not to accept requests with a proposal number lower than (in response to the prepare request from proposer B).

Proposer B sends an accept request to each acceptor containing the proposal number it previously used (n=4) and the value associated with the highest proposal number among the prepare response messages it received (v=8)[3]. Note that this is not the value that proposer B initially proposed, but the highest value from the prepare response messages it saw.

Figure 5: Paxos. Proposer B sends an accept request to each acceptor, with its previous proposal number (4), and the value of the highest numbered proposal it has seen (8, from [n=2, v=8]).

If an acceptor receives an accept request for a higher or equal proposal number than it has already seen, it accepts and sends a notification to every learner node. A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors have accepted a value, as is illustrated below:

Figure 6.

Once a value has been chosen by Paxos, further communication with other proposers cannot change this value. If another proposer, proposer C, sends a prepare request with a higher proposal number than has previously been seen, and a different value (for example, n=6, v=7), each acceptor responds with the previous highest proposal (n=4, v=8). This requires proposer C to send an accept request containing [n=6, v=8], which only confirms the value that has already been chosen. Furthermore, if some minority of acceptors have not yet chosen a value, this process ensures that they eventually reach consensus on the same value.

Various efficiency improvements to the standard Paxos algorithm are discussed in the papers by Lamport and Baker et al.. For example, a prepare request is not necessary if the proposer knows that it is the first to suggest a value. The proposal for such a request is numbered 0, so that it will be ignored if any higher numbered requests have been received.

General Utility

Paxos has often been criticized for its complexity, particularly with respect to the challenge of implementing it in a functional form. In spite of this, it’s an interesting example of a particularly challenging distributed systems problem, and a clever, conceptually-clean solution.

If you’re interested in this topic, i’d recommend reading about real-world examples of Paxos and two- and three-phase commit, starting with the references below.

This post is an updated version of one that first appeared on angusmacdonald.me. It was in turn based on work done during my Ph.D.

References

L. Lamport, “Paxos Made Simple” in ACM SIGACT News, vol. 32, no. 4, pp. 18–25, 2001.

Baker, J., Bond, C., Corbett, J. C., Furman, J., Khorlin, A., Larson, J., Léon, J. M., Megastore: Providing Scalable, Highly Available Storage for Interactive Services in Proceedings of the Conference on Innovative Data Systems Research, pp. 223-234, 2011.

T. D. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: an engineering perspective”, in Proceedings of the twenty-sixth annual ACM Symposium on Principles of Distributed Computing, 2007, pp. 398–407.

M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, in Proceedings of OSDI 2006.


[1] The method of ensuring the uniqueness of proposal numbers when there are multiple proposers is not specified in the Paxos algorithm itself.

[2] It may not, but the algorithm is resilient to this.

[3] Note that this is the highest proposal number that it received from prepare response messages. In this example, proposer B has a higher numbered proposal (n=4) than proposer A (n=2), but it has only received proposer A’s proposal in response to its prepare request. If no previous proposals were returned by the prepare response messages, proposer B would use its own proposal (n=4).

6 Responses to Distributed Consensus: Paxos by Example

  1. Craig Y. says:

    Hey, great post! It’s an especially timely one for me because I’ll be taking a distributed computing exam in a few hours. The diagrams in particular are wonderful and much more clear than ones I’ve seen previously.

  2. Naveen says:

    I have a query over the following step:
    …and sends back the previous highest proposal number that it has seen along with the value of that proposal

    Am I right in thinking that this step does not appear in ‘Paxos Made Simple’? In that paper, I only see mention:
    …and with the highest-numbered proposal (if any) that it has accepted.
    ‘Accepted’ being the key-word (not ‘proposed’).

    So why is the extra step included?
    I’m not seeing how or if it makes a difference to liveness or safety?
    It seems that if 3 proposals arrive in sequence – the final accept will include the value of the penultimate proposal, which seems somewhat arbitrary.

    Am I missing something?

    Thanks

    • When I say ‘the highest proposal number it has seen’, i’m referring to the ‘highest-numbererd proposal it has accepted’, so I think it says the same thing (a little ambiguous though).

      I use the word ‘proposal’ at the end of your quote because I see that as the object that ties together the number (which was accepted) and its corresponding value (which is returned). So I see that as a proposal which was subsequently accepted.

      Regarding the extra step: it is required so that once a value is chosen, it is never changed — the paragraph beginning with ‘Once a value has been chosen by Paxos’ has an example of a situation where this would occur (and where the extra step prevents a problem).

      In terms of it being ‘somewhat arbitrary’, this is true, but the purpose of the algorithm isn’t to produce a last-write-wins result, but to ensure that everyone agrees on one of the writes — the algorithm doesn’t care which it is.

      I hope that helps, Naveen. Thanks for the question!

  3. […] Questo articolo è una traduzione ed un miglioramento dell’articolo originale di Angus Macdonald. […]

  4. Eugene Von says:

    how can learner decide which is the majority?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

AetherStore

Use your space.

AetherStore turns the spare storage space on your office workstations into a shared storage system.

Find out more here.