Don’T Settle For Eventual: Scalable Causal Consistency For Wide-Area Storage Alongside Cops

Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen
Proc. 23rd ACM Symposium on Operating Systems Principles

(SOSP ’11) Cascais, Portugal, Oct 2011.

Geo-replicated, distributed information stores are all the rage now. These stores provide broad expanse network (WAN) replication of all information across geographic regions inward all the datacenters. Geo-replication is really useful for global-scale web-applications, particularly social networking applications. In social networking, your updates to your line of piece of job organisation human relationship (tweets, posts, etc) are replicated across several regions because you lot may select followers there, too they should get low-latency access to your updates. The newspaper refers to properties desired of such a geo-replication service amongst the acronym ALPS (Availability, depression Latency, Partition-tolerance, too high Scalability).

Of course, inward improver to ALPS, nosotros withdraw to add together around consistency requirement to the geo-replication service, because otherwise the updates to a tape can arrive at replicas inward different/arbitrary orders too bring out inconsistent views of the updates to the record. The eventual consistency belongings that Dynamo advocates is every bit good weak for the programmer to ready on to furnish other services. The inconsistent views of the updates tin escape too toxicant those other services. On the other hand, strong-consistency (linearizability) is impossible due to CAP. So, causal consistency is what the COPS newspaper shoots for. Actually, a after paper shows that this is the best you lot tin larn nether these constraints.

Yahoo!’s PNUTS provides per-record timeline consistency (or per-key sequential consistency), which is a pretty skilful consistency guarantee. All the writes to a record are serialized past times the per-record-primary server responsible for that record too these updates are replicated inward the same guild (the guild the per-record primary determines) to the other replicas. PNUTS, however, does not provide whatever consistency across records/keys. Achieving such a consistency (more specifically, causal consistency) across records/keys introduces scalability challenges too is the aim of this paper, every bit nosotros explicate inward the balance of this summary. (Spoiler: To achieve scalability, instead of using vector clocks, COPS explicitly encodes dependencies inward metadata associated amongst each key's version. When keys are replicated remotely, the receiving datacenter performs dependency checks before committing the incoming version.)

Causal consistency model

Dave Anderson explains the causal-consistency model nicely inward his weblog post.

Consider 3 updates past times a unmarried thread: write A=1, write B=dog, write B=cow, write A=2.

In Dynamo, the guild at which these updates are applied is unconstrained. In a
remote datacenter, read(A), read(A) mightiness render A=2, A=1

In PNUTS, the value of “A” could never larn backwards: A=1, A=1, … A=2, A=2, …

But inward both Dynamo too PNUTS, the relative values of A too B are unconstrained, and each of these systems could read(A), read(B): {2, dog}. Even though the originating thread laid B=cow earlier it laid A=2.

In COPS, this latter read sequence could non happen. The allowable reads are: {1, }, {1,dog}, {1,cow}, {2, cow}

In all 3 systems (in PNUTS, using Read-any), concurrent updates at different datacenters could drive conflicts that invoke conflict handlers, too inward this way the 3 systems are non different.

And this brings us to the number of conflict treatment inward COPS.

Convergent conflict handling

In causal consistency you lot tin soundless select conflicts for concurrent events. (Inherent inward a distributed system, 2 nodes may kickoff updates concurrently, and this leads to a "logical" conflict every bit both updates are uninformed of each other.) For resolving these too to accomplish converge at all datacenters, COPS employs convergent conflict handling, which refers to using an associative and commutative handler. Thomas's last-write-wins dominion satisfies this constraint and is used past times default, or developer tin write application-specific conflict-resolution rules provided that they are convergent.

More concretely, convergent conflict treatment requires that all conflicting puts be handled inward the same mode at all replicas, using a handler component subdivision h. This handler component subdivision h must live associative too commutative, then that replicas can handle conflicting writes inward the guild they have them too that the results of these handlings volition converge (e.g., ane replica’s h(a, h(b, c)) too another’s h(c, h(b, a)) agree).

The newspaper coins the term "causal+" to cite to causal-consistency addition convergence conflict treatment together.

Scalability problems inward causal+ model

Previous causal+ function (Bayou, Practi) was non scalable, because they achieved causal+ via log-based replay. Log-exchange-based serialization inhibits replica scalability, every bit it relies on a unmarried serialization betoken inward each replica to establish ordering. Thus, causal dependencies betwixt keys are express to the set of keys that tin live stored on ane node, or alternatively, a unmarried node must provide a commit ordering too log for all operations across a cluster.

Inner-workings of COPS

A get functioning for a tape is local at the closest datacenter too is non-blocking. Since all information is replicated at each datacenter, nosotros tin select local get.

A put functioning for a tape is
1) translated to put-after-dependencies based on the dependencies seen inward this site
2) queued for asynchronous replication to other sites/replicas
3) returns done to the customer at this betoken (early reply)
4) asynchronous replication to other sites/datacenters occur.

Each functioning maintains dependencies for operations. Replication dependencies are checked at each datacenter, too when they are satisfied the value is updated there.

Nodes inward each datacenter are responsible for dissimilar partitions of the keyspace, merely the organisation tin runway too enforce dependencies betwixt keys stored on dissimilar nodes. COPS explicitly encodes dependencies inward metadata associated with each key’s version. When keys are replicated remotely, the receiving datacenter performs dependency checks earlier committing the incoming version. The newspaper shows that past times alone checking the nearest dependencies you lot tin reduce the number of dependency checks during replication too select a faster solution.

Subleties inward Definition of availability for lay operation

The availability of larn too lay operations are defined at the datacenter scope and non at the whole organisation scope. The newspaper defines availability as: "All operations issued to the information shop consummate successfully. No functioning can block indefinitely or render an fault signifying that information is unavailable."

What this agency is that the availability of lay is satisfied at measurement 3 when early respond is returned to the customer of the datacenter. However, the replication of this lay functioning to other datacenters inward fact tin live blocked until put-after-dependencies for the functioning is satisfied at other datacenters. Consider a lay functioning originated at site C, which introduced put-after-dependencies to updates originated at site B. When this lay operation (now a put-after operation) is asynchronously replicated to site A at measurement 4, this put-after volition block at site A until site A receives the dependent-upon updates originated at site B. And if site B is partitioned away from site A, the put-after functioning originated at C volition block at A. Although site C too A are connected the lay functioning volition non live able to consummate measurement four every bit long every bit B is disconnected from A. I recall this is the Hybrid Logical Clocks (HLC). The loosely synchronized real-time constituent of HLC would assistance inward truncating the dependency list. The logical clock constituent of HLC would assistance inward maintaining this dependency listing exactly fifty-fifty inward the incertitude share of loosely synchronized clocks.

Finally, the COPS customer library solution also did non line of piece of job organisation human relationship for whether the customer has backchannels. Using real-time windows tin assistance line of piece of job organisation human relationship for the backchannels every bit well.