Efficient Replication Of Large Information Objects
the ABD replicated atomic storage algorithm for replication of large objects. When objects beingness replicated is much larger than the size of the metadata (such equally tags or pointers), it is efficient to tradeoff performing cheaper operations on the metadata inward gild to avoid expensive operations on the information itself.
The basic thought of the algorithm is to separately shop copies of the information objects inward replica servers, together with information nearly where the most up-to-date copies are located inward directory servers. This Layered Data Replication (LDR) approach adopts the ABD algorithm for atomic fault-tolerant replication of the metadata, together with prescribes how the replication of the information objects inward the replica servers tin flaming accompany replication of the metadata inward directory servers inward a concurrent together with consistent fashion: In gild to read the data, a customer kickoff reads the directories to unwrap the laid of up-to-date replicas, together with then reads the information from ane of the replicas. To write, a customer kickoff writes its information to a laid of replicas, together with then informs the directories that these replicas are directly up-to-date.
The LDR algorithm replicates a unmarried information object supporting read together with write operations, together with guarantees that the operations look to come about atomically. While at that topographic point be multiple physical copies of the data, users alone meet ane logical copy, together with user operations look to execute atomically on the logical copy. As such LDR provides linearizability, a rigid type of consistency, that guarantees that a read functioning returns the most recent version of data. LDR provides single-copy consistency together with is on the CP side of the CAP triangle; availability is sacrificed when a bulk of replicas are unreachable.
Client Protocol
When customer i does a read, it goes through iv phases inward order: rdr, rdw, rrr together with rok. The stage names depict what happens during the phase: read-directories-read, read-directories-write, read-replicas-read, together with read-ok. During rdr, i reads (utd, tag) from a quorum of directories to unwrap the most up-to-date replicas. i sets its ain tag together with utd to endure the (tag, utd) it read alongside the highest tag, i.e., timestamp. During rdw, i writes (utd, tag) to a write quorum of directories, hence that subsequently reads volition read i’s tag or higher. During rrr, i reads the value of x from a replica inward utd. Since each replica may shop several values of x, i tells the replica it wants to read the value of x associated alongside tag. During rok, i returns the x-value it read inward rrr.When i writes a value v, it also goes through iv phases inward order: wdr, wrw, wdw together with wok. These stage names correspond write-directories-read, wrw for write-replicas-write, wdw for write-directories-write, together with wok for write-ok, respectively. During wdr, i reads (utd, tag) from a quorum of directories, together with then sets its tag to endure higher than the largest tag it read. During wrw, i writes (v, tag) to a laid acc of replicas, where |acc| ≥ f + 1. Note that the laid acc is arbitrary; it does non cause got to endure a quorum. During wdw, i writes (acc, tag) to a quorum of directories, to betoken that acc is the laid of most up-to-date replicas, together with tag is the highest tag for x. Then i sends each replica a secure message to say them that its write is finished, hence that the replicas tin flaming garbage-collect older values of x. Then i finishes inward stage wok.
If yous cause got difficulty inward agreement the ask for 2-round directory reads/writes this protocol, reviewing how the ABD protocol plant volition help.
Replica together with Directory node protocol
The replicas answer to customer requests to read together with write values of information object x. Replicas also garbage-collect out of appointment values of x, together with gossip amid themselves the latest value of x. The latter is an optimization to assist spread the latest value of x, hence that clients tin flaming read from a nearby replica.The directories' alone chore is to answer to customer requests to read together with write utd together with tag.
Questions together with discussion
Google File System (SOSP 2003) addressed efficient replication of large information objects for datacenter computing inward practice. GFS also provides a metadata service layer together with information object replication layer. For the metadata directory service, GFS uses Chubby, a Paxos service which ZooKeeper cloned equally opensource. Today if yous desire to produce from a consistent large object replication storage from scratch, your architecture would most probable purpose ZooKeeper equally the metadata directory coordination service equally GFS prescribed. ZooKeeper provides atomic consistency already, hence it eliminates the 2-round needed for directory-reads together with directory-writes inward LDR.LDR does non purpose a split upwards metadata service, instead it tin flaming scavenge raw dumb storage nodes for directory service together with accomplish the same number past times using ABD replication for making the metadata directory atomic/fault-tolerant. In other words, LDR takes a fully-decentralized approach, together with tin flaming back upwards loosely-connected heterogenous wimpy devices (maybe fifty-fifty smartphones?). I jurist that agency to a greater extent than freedom. On the other hand, LDR is bad for performance. It requires ii rounds of directory-write for each write functioning together with ii rounds of directory-read for each read operation. This is major drawback for LDR. Considering reads are mostly 90% of the workload, supporting 1 circular directory-reads would cause got alleviated the performance work somewhat. Probably inward normal cases (in the absence of failures, the kickoff directory read (rdr operation) volition present the up-to-date replica re-create is acquaint inward a quorum of directory nodes, together with the minute circular of directory access (rdw operation) tin flaming endure skipped.
Using ZooKeeper for the metadata directory helps a lot, but a downside tin flaming endure that ZooKeeper is a unmarried centralized location, together with that agency for roughly clients across to ZooKeeper volition ever incur high WAN communication penalty. Using ZooKeepers observers trim this toll for read operations. And equally I volition weblog nearly soon, our locomote on WAN-Keeper reduces this toll also for write operations. The LDR newspaper suggests that LDR is suitable for WAN, but LDR even hence incurs WAN latencies piece accessing a quorum of directory nodes (twice!) across WAN.
Another way to efficiently replicate large information objects is of course of teaching key-value stores. In key-value stores, yous don't cause got a metadata directory, equally "hashing" takes attention of that. On the other hand, most key-value stores sacrifice rigid consistency, inward lieu for eventual consistency. Is it truthful that yous can't simply larn away alongside using hashes together with need roughly form of metadata service if yous similar to accomplish consistency? The consistent key-value stores I tin flaming retrieve of (and I can't retrieve of likewise many) purpose either a Paxos commit on metadata or at to the lowest degree a chain replication approach such equally inward Hyperdex together with Replex. The chain replication approach uses a Paxos box alone for directory node replication configuration information; does that even hence count equally a minimal together with 1-level-indirect metadata service?
0 Response to "Efficient Replication Of Large Information Objects"
Post a Comment