Gryff: Unifying Consensus As Well As Shared Registers

This newspaper is past times Matthew Burke, Audrey Cheng, as well as Wyatt Lloyd, as well as appeared inward NSDI'20. Here is a link to the paper, slides, as well as video presentation.

Straight speak (from the midpoint of the book)

The model of the newspaper is a keen contribution. Stable versus Unstable ordering is a proficient framework to intend in. Carstamps (consensus after registers) logical clock timestamping is a proficient agency to realize this ordering. I intend carstamps volition come across proficient adoption, equally it is clear, concrete, as well as useful.
Constructing a hybrid of EPaxos as well as ABD is a novel idea.
The performance of Gryff is non good. A straightforward distributed key-value sharded implementation of Paxos would practice a meliorate job. I intend Hermes is a meliorate selection than Gryff alongside read-write as well as read-modify-write operations.

Introduction

Recently nosotros come across a lot of involvement inward unifying consensus as well as shared registers, the topic of the paper. I intend this is because of the popularity of distributed key-value stores/systems. While consensus is oftentimes used for implementing distributed key-value stores, this is a combat of an overkill. You don't require to know the previous value of the key, if y'all are overwriting it alongside a write functioning to the key. For the write operation, the novel value is non a role of the sometime value. As Gryff says inward the abstract: "Consensus is inefficient for workloads composed by as well as large of reads as well as writes". ABD is proficient plenty for that, as well as it fifty-fifty gives y'all linearizability to the human face upwards of concurrent asynchronous execution alongside crash faults. However, ABD for shared registers is also weak to implement stronger synchronization primitives. You would silent require to utilization the consensus gun for the occasional read-modify-write operation.

So this led many people to intend well-nigh providing read/write operations as well as read-modify-write operations separately for implementing distributed key-value systems. Hermes (ASPLOS'20), which I reviewed earlier, is an illustration of this. Fine-Grained Replicated State Machines (NSDI'20), which nosotros volition speak over inward an upcoming Zoom DistSys reading group, also looks at a related problem.

Consensus vs. Shared Registers

A large contribution of the newspaper is the framework it introduces for thinking well-nigh read write operations as well as read-modify write operations.

Applying commands inward the same gild on all replicas requires an ordering machinery that is stable, i.e., a replica knows when a command's seat is fixed as well as it volition never have an before command. In asynchronous systems where processes tin fail, consensus protocols are used to concur on this stable ordering.

Shared register protocols render a linearizable ordering of operations. That ordering does non bring to live stable, however, because each write functioning fully defines the patch of the object. Thus, a replica tin safely apply a write w4 fifty-fifty if it does non know well-nigh before writes. If an before write w3 always does arrive, the replica only ignores that write because it already has the resulting patch from applying w3 as well as thence w4. Figure 1b shows shared register ordering where at that topographic point is a total gild of all writes (denoted past times <) without stability.

A shared object inward Gryff exposes the next interface:

READ(): returns the value of the object
WRITE(v): updates the value of the object to v
RMW( f (·)): atomically reads the value v, updates value to f (v), as well as returns v

Carstamps for right ordering

The newspaper says that AQS (active quorum system) protocol [2010] which tried to unify consensus as well as shared registers, has a subtle põrnikas that allows rmw to live misplaced/misordered. The newspaper says that, to simplify reasoning well-nigh correctness, it is best to enforce the interaction at a deeper level, inward the ordering mechanism, past times imposing a structural gild inward the timestamps.

To leverage this insight, the newspaper introduces consensus-after-register timestamps, or carstamps. Carstamps allow writes as well as rmws to concurrently modify the same patch without serializing through a leader or incurring additional circular trips. Reads utilization carstamps to determine consistent values without interposing on concurrent updates.

Gryff Protocol

The yell Gryff stands for Griffin, a hybrid betwixt king of beasts as well as eagle, equally an attribution for the protocol beingness a hybrid betwixt EPaxos as well as ABD.

The solely divergence inward read-write inward Gryff from multi-writer ABD is that replicas hold a carstamp associated alongside the electrical current value of the shared object instead of a tag thence that rmws are properly ordered alongside honour to reads as well as writes.

Write. The rmwc champaign is reset to 0 (Line fifteen of Algorithm 1).
Reads. "We brand the same observation equally Georgiou et al. [26] that the instant stage inward the read protocol of multi-writer ABD is redundant when a quorum already shop the value as well as associated carstamp chosen inward the foremost phase."

I similar the succinct description of EPaxos inward the paper; it depicts EPaxos equally a generalization of Paxos alongside its 3 phases:

EPaxos is a consensus protocol that provides optimal commit latency inward the wide-area. It has 3 phases inward failure-free executions: PreAccept, Accept, as well as Commit. If a ascendence commits on the fast path (i.e., If the coordinator receives a fast/supermajority quorum of responses that all incorporate the same dependencies), the coordinator returns to the customer after the PreAccept stage as well as skips the Accept stage (where it builds the lastly dependencies for the ascendence past times taking the wedlock of all the dependencies that it received inward the PreAccept phase). Otherwise, the ascendence commits on the tiresome path after the Accept phase. Commands that practice non read patch consummate at the outset of the Commit phase; commands that practice read patch consummate after a unmarried replica, typically the coordinator, executes the ascendence to obtain the returned state. The move of the PreAccept as well as Accept phases is to institute the dependencies for a command, or the laid of commands that must live executed before the electrical current command. The move of the Commit stage is for the coordinator to notify the other replicas of the agreed-upon dependencies.

Gryff makes 3 high-level modifications to EPaxos to unify its stable ordering alongside the unstable ordering of the multiwriter ABD read as well as write protocol.

A base of operations update attribute, base, is decided past times the replicas during the same procedure that establishes the dependencies as well as the jurist sequence number for a rmw.
A rmw completes after a quorum execute it.
When a rmw executes, it chooses its base of operations update from betwixt its base of operations attribute as well as the resultant of the previously executed rmw prev. The resultant of the executed rmw is applied to the value as well as carstamp of the executing replica.

The foremost alter adapts EPaxos to piece of work alongside the unstable gild of writes past times fixing the write upon which it volition operate. The instant alter adapts EPaxos to piece of work alongside reads that bypass its execution protocol as well as straight read state. In other words, the instant alter makes the ABD protocol able to read the EPaxos outcome. This makes the commit stage a two-phase operation; the rmw coordinator should come across the commit completed past times a bulk quorum.

In sum, Gryff makes reads to a greater extent than efficient past times performing them alongside ABD (with the above-mentioned Georgiou optimization) instead of EPaxos where a supermajority quorum would live needed. While Gryff uses two-round ABD writes, I intend that may also live reduced to 1 circular write alongside a fox for inferring the higher TS value to suggest fourth dimension alongside (like using HLC clock timestamps) inward an optimistic agency as well as acquire the higher timestamp if that fails, as well as consummate inward 2 rounds.

On the other hand, I am starting to similar the invalidation thought inward Hermes more. In contrast to ABD used inward Gryff, Hermes allows linearizable reads piece writes are ongoing, as well as local reads at that.

A coordinator node issues a write to a telephone substitution solely if it is inward the Valid state; otherwise the write is stalled. This doesn't seem to live necessary for safety, because the higher timestamped writes volition preempt the lower timestamped writes. So why does Hermes practice this? I intend they practice this, because it acquire replicas come across the writes concluded, fifty-fifty when at that topographic point is a deluge of writes to the same key. This may inward plow aid alleviate the read starvation due to constant overflowing of writes to the same key. I found this inward the slack channel for ASPLOS'20 from the foremost author:
It is condom for a read that initially found the object invalidated alongside version timestamp 2 as well as thence afterward invalidated alongside a version timestamp 3 to acquire serialized as well as render the version 2 value. Intuitively this is partly condom because a write alongside version 3 could non bring started unless the write alongside version 2 has been committed.

Proxying reads

The base of operations Gryff read protocol provides reads alongside unmarried round-trip fourth dimension latency from the coordinator to the nearest quorum including itself (1 RTT) when at that topographic point are no concurrent updates. Otherwise, reads bring at most 2 RTT latency. The newspaper discusses how read latency tin live farther improved inward deployments across broad expanse networks.

Because the round-trip fourth dimension to the replica that is colocated alongside a customer procedure is negligible relative to the interreplica latency, replicas tin coordinate reads for their colocated clients as well as utilize their local patch inward the read coordinator protocol to terminate after 1 RTT to a greater extent than often. When using this optimization, nosotros say that the coordinating replica is a proxy for the customer process's read.

Propagating Extra Data inward Read Phase 1. The proxy includes inward the Read1 (i.e., read-phase1) messages its electrical current value v as well as carstamp cs. Upon receiving a Read1 message alongside this additional information, a replica applies the value as well as carstamp before returning its electrical current value as well as carstamp. This has the number of ensuring every replica that receives the Read1 messages volition bring a carstamp (and associated value) at to the lowest degree equally large equally the carstamp at the proxy when the read was invoked.
Updating the Proxy’s Data. The proxy also applies the values as well as carstamps that it receives inward Read1Reply messages equally it receives them as well as before it makes the conclusion of whether or non to consummate the read after the foremost phase. If every respond contains the same carstamp, thence the read completes after 1 RTT fifty-fifty if the carstamp at the proxy when the read was invoked is smaller than the carstamp contained inward every reply.

Evaluation

Gryff is implemented inward the same framework equally EPaxos as well as MultiPaxos as well as its performance is evaluated inward a geo-replicated setting. The evaluation shows that, for workloads alongside moderate contention, Gryff reduces p99 read latency to ∼56% of EPaxos, but has ∼2x higher write latency. This tradeoff allows Gryff to cut back service-level p50 latency to ∼60% of EPaxos for large-scale spider web applications whose requests fan-out into many storage-level requests.

My large work alongside the evaluation is that it doesn't utilization leader-leases optimization inward MultiPaxos to allow serving of reads locally at the leader. This touchstone optimization would probable Pb to MultiPaxos giving the best resultant for read latencies inward the evaluations.

Another affair missing inward the evaluation is a comparing alongside Cassandra. Cassandra implements read/write registers via ABD-like algorithm as well as tin give linearizability if y'all configure the quorums accordingly. Cassandra also has CASPaxos for compare-and-set for conditional write, which tin live used to implement read-modify-write.

The evaluation shows less blocking for rmws inward Gryff. Gryff achieves 2 RTT rmws when at that topographic point are no conflicts as well as 3 RTT when at that topographic point are. While Gryff must silent block the execution of rmws until all dependencies bring been received as well as executed, Gryff experiences significantly less blocking than EPaxos. This is because EPaxos needs to bring dependencies on writes, but Gryff’s rmw protocol does non proceed rails of dependencies on writes.

We come across that Gryff as well as EPaxos each accomplish a slightly higher maximum throughput than MultiPaxos due to their leaderless structure. (This is of course of written report at depression conflict rates, because at high conflict rates EPaxos as well as Gryff pay a rigid price). It is tardily to access MultiPaxos to per-key sharded Paxos, as well as that would compete as well as out-do Gryff as well as EPaxos. For achieving best performance (for both throughput as well as latency) for a WAN key-value deployment, however, I suggest using our WPaxos protocol, equally it tin adjust to locality of access equally well.