Replicated/Fault-Tolerant Atomic Storage


The cloud computing community has been showing a lot of honey for replicated/fault-tolerant storage these days. Examples of replicated storage at the datacenter grade are GFS, Dynamo, Cassandra, as well as at the WAN grade PNUTS, COPS, as well as Walter. I was searching for foundational distributed algorithms on this topic, as well as  found this squeamish tutorial newspaper on replicated atomic storage: Reconfiguring Replicated Atomic Storage: A Tutorial, M. K. Aguilera, I. Keidar, D. Malkhi, J-P Martin, as well as A. Shraer, 2010.

Replication provides masking fault-tolerance to crash failures. However, this would hold upwards a limited/transient fault-tolerance unless you lot reconfigure your storage service to add together a novel replica to supersede the crashed node. It turns out, this on-the-fly reconfiguration of a replicated storage service is a subtle/complicated number due to the concurrency as well as fault-tolerance issues involved. This squad at MSR @ Slicon Valley has been working on reconfiguration issues inwards replicated storage for some time. But, inwards this postal service I am non going to verbalize nearly their reconfiguration work, as well as instead volition only focus on the replicated/fault-tolerant atomic storage part.

Majority replication algorithm
There is this elegant algorithm for replicated/fault-tolerant atomic storage that I holler upwards every distributed systems researcher/developer should know about. It is uncomplicated as well as powerful. And, it is fun; I hope your encephalon volition experience meliorate nearly itself afterward you lot acquire this bulk replication algorithm. The algorithm originally appeared in: Sharing retention robustly inwards message-passing systems, H. Attiya, A. Bar-Noy, as well as D. Dolev, 1995. Here I volition summarize the algorithm based on the give-and-take provided inwards the tutorial paper.

The algorithm employs bulk replication to furnish atomic read/write operations inwards the presence of crash failures nether an asynchronous execution model. (Of course, the FLP number states the impossibility of solving consensus nether this model, but this is a weaker job than solving consensus.) Here, atomic way that the organization provides linearizability, a rigid type of consistency that guarantees that a read returns the most recent version of data. This single-copy consistency is stronger than Amazon Dynamo's eventual consistency as well as fifty-fifty GFS's consistency. The algorithm is on the CP side of the CAP triangle; availability is sacrificed when a bulk of replicas are unreachable.

Write operation
Each storage node keeps a local re-create of what it believes to hold upwards the most recent value stored yesteryear a client, together alongside a timestamp indicating the freshness of the value. A vt-pair refers to a span of (value, timestamp), which a storage node keeps. To execute a write(v) operation, the customer proceeds inwards ii phases: it executes a acquire stage followed yesteryear a gear upwards phase.

get phase:
vt-set= read vt pairs from bulk of storage nodes
pick out unique t' such that t' > max (t inwards vt-set)

set phase:
write_request (v, t') on storage nodes
storage nodes shop vt' exclusively if t' > their stored t
storage nodes shipping ack
when bulk acks are received, render OK

(Uniqueness of t' tin hold upwards ensured yesteryear adjoining the client-id to the timestamp, therefore that a timestamp consists of a span alongside a number as well as a client-id, ordered lexicographically.)

Read operation
The read() performance is really similar to the write operation. The customer likewise executes the acquire as well as gear upwards phases dorsum to back. The exclusively departure is that inwards the gear upwards phase, the customer writes dorsum the maximum timestamp vt span it learns inwards the acquire phase.

get phase:
vt-set= read vt pairs from bulk of storage nodes
pick out vt' such that t' = max (t inwards vt-set)

set phase:
write_request (v,t') on storage nodes
storage nodes shop vt' exclusively if t' > their stored t
storage nodes shipping ack
when bulk acks are received, render v

The Set stage inwards read() is needed to preclude oscillating reads due to storage node failures, inwards which successive reads oscillate betwixt a novel as well as an one-time value spell a write is inwards progress-- which is a violation of atomicity. The Set stage ensures that a subsequent read() volition render a value at to the lowest degree every bit recent every bit the value returned yesteryear the electrical current read().  The fundamental intuition hither is that whatsoever ii majorities of storage nodes ever convey at to the lowest degree 1 storage node inwards common. Therefore if some customer stores value v at a bulk of storage nodes therefore some other customer is guaranteed to run into v when it queries whatsoever majority.

Relation to Paxos
The bulk replication algorithm seems closely related to the Paxos consensus algorithm. The t inwards the vt-pair corresponds to ballot number inwards Paxos. The Get as well as Set phases check to the phase1 as well as phase2 of Paxos. (Of course of written report since bulk replication is non for consensus, in that place is zilch corresponding to phase3:commit of Paxos.) Finally, the read performance corresponds to the learning performance inwards Paxos. Now the differences. In bulk replication clients occupation non coordinate for the write operation, whereas inwards Paxos, leaders are constrained to re-propose/rewrite the value alongside the highest t. Also to avoid the dueling-leaders problem, Paxos relies on a leader election service therefore that the organization eventually converges to 1 leader that tin safely anchor/finalize a value every bit the conclusion value. (The leader election service inwards Paxos needs some partial-synchrony to brand progress, therefore consensus is achieved exclusively then.) In summary, bulk replication is a edifice block for Paxos consensus.

This relation is explained inwards to a greater extent than item inwards the "Perspectives on the CAP theorem" paper.

Concluding remarks
The squeamish affair nearly this elegant algorithm is that it tin tolerate/mask the crash of a minority of storage nodes as well as an arbitrary number of customer nodes, as well as it plant inwards an "asynchronous" system. That the correctness of this algorithm does non depend on a synchronous organization makes this algorithm actually robust for deployment inwards distributed systems, peculiarly WAN systems.

Consensus/Paxos based algorithms tin brand reconfiguration of replication service possible. Examples are RAMBO algorithm, as well as FAB: Building Distributed Enterprise Disk Arrays from Commodity Components, which provides an implementation of these ideas. But, the reconfiguration tutorial newspaper explains that it is likewise possible to implement reconfiguring of replication nether the asynchronous model (without consensus)!!

0 Response to "Replicated/Fault-Tolerant Atomic Storage"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel