Exploiting Commutativity For Practical Fast Replication (Nsdi'19)

This newspaper appeared inward NSDI'19, in addition to is past times Seo Jin Park in addition to John Ousterhout. You tin mail away access the newspaper every bit good every bit the conference presentation video here. 

Here is too a YouTube link to a presentation of the paper inward our Zoom DistSys reading group. This was the tenth newspaper nosotros discussed inward our Zoom group. We have got been coming together on Zoom every Midweek since Apr 1st at 15:30 EST. Join our Slack channel to instruct newspaper information, give-and-take in addition to coming together links https://join.slack.com/t/distsysreadinggroup/shared_invite/zt-eb853801-3qEpQ5xMyXpN2YGvUo Iyg

The problem

The newspaper considers performing consistent replication fast, inward 1 circular trip fourth dimension (RTT) from client.

Of course, if nosotros were to create asynchronous replication, nosotros could instruct 1 RTT. The customer would contact the leader inward the replication group, the leader volition admit dorsum in addition to start replicating to backups/followers asynchronously. The work alongside this scheme is that it is non fault-tolerant. The leader may crash subsequently acking the write, in addition to the customer would lose read-your-own-write guarantee, allow lonely linearizability.

The newspaper introduces the Consistent Unordered Replication Protocol (CURP) to address this problem. CURP augments the inward a higher house asynchronous replication scheme alongside a temporary stash for fault-tolerance.

Happy path

CURP supplements the primary-backup replication organization alongside a fault-tolerance stash: witness-based out-of-order lightweight replication. The customer replicates each functioning to the witnesses inward parallel when sending the asking to the principal server. The principal executes the functioning in addition to returns to the customer without waiting for normal replication to its backups, which happens asynchronously. When the customer hears from the f witnesses in addition to the speculative respond from the primary, it considers the write to live completed.

Since the master copy returns to clients earlier replication to backups, nosotros instruct 1 RTT updates. The role of the witnesses ensures durability. Once all f witnesses have got accepted the requests, clients are assured that the requests volition hold out master copy crashes, so clients consummate the operations alongside the results returned from masters. If the master copy indeed crashes, information from witnesses is combined alongside that from the normal replicas to imitate a consistent server state.

This 1 RTT completion exclusively industrial plant for commutative operations, whose companionship of execution does non matter. It won't move for non-commutative operations because the witnesses create non (and cannot inward a consistent way) companionship operations they have from clients every bit the companionship of arrival of operations may live dissimilar across the witnesses.

Execution of non-commutative operations

Non-commutative operations  require 2 or three RTTs, in addition to move every bit follows.

A witness accepts a novel operation-record from a customer exclusively if the novel functioning is commutative alongside all operations that are currently saved inward the witness. If the novel asking doesn't commute alongside i of the existing requests, the witness must spend upwards the tape RPC since the witness has no agency to companionship the ii non-commutative operations consistent alongside the execution companionship inward masters. For example, if a witness already accepted "x :=1", it cannot convey "x:==5". Witnesses must live able to create upwards one's hear whether operations are commutative or non simply from the functioning parameters. For example, inward key-value stores, witnesses tin mail away exploit the fact that operations on dissimilar keys are commutative.

Therefore, for a non-commutative operation, the customer volition non live able to have ACKs from all f witnesses. In this case, to ensure the durability of the operation, the customer must expect for replication to backups past times sending a sync RPC to the master. Upon receiving a sync RPC, the master copy ensures the functioning is replicated to backups earlier returning to the client. This waiting for sync increases the functioning latency to 2 RTTs inward most cases in addition to upwards to three RTT inward the worst instance where the master copy hasn't started syncing until it receives a sync RPC from a client.

Garbage collection

To boundary retention usage inward witnesses in addition to trim down possible rejections due to commutativity violations, witnesses must discard requests every bit shortly every bit possible. Witnesses tin mail away drib the recorded customer requests subsequently masters brand their outcomes durable inward backups. In CURP, the master copy sends garbage collection RPCs for the synced updates to their witnesses.

This is problematic. The quick communication from the master copy to the witnesses perish critical for the performance of CURP. If functioning records cannot live garbage-collected quickly, the performance suffers.

Even if nosotros co-locate witnesses alongside backups, you lot all the same demand an extra commit message from the leader for garbage collection. It may live possible to piggyback this garbage-collection message to to a greater extent than or less other write, but all the same garbage collection existence tied to performance is bad.

Crashes in addition to recovery

CURP recovers from a master's crash inward ii phases: (1) the novel master copy restores information from backups in addition to and so (2) it restores operations past times replaying requests from witnesses.

For the instant part, the novel master copy picks whatever available witness. After picking the witness to recover from, the novel master copy showtime asks it to halt accepting to a greater extent than operations; this prevents clients from erroneously completing update operations subsequently recording them inward a stale witness whose requests volition non live retried anymore. After making the selected witness immutable, the novel master copy retrieves the requests recorded inward the witness. Since all requests inward a unmarried witness are guaranteed to live commutative, the novel master copy tin mail away execute them inward whatever order.

This was for recovering from a master copy crash. There is added complexity for reconfiguration operations every bit well. The reconfiguration department discusses recovery of a crashed backup in addition to recovery of a crashed witness every bit well.

CURP doesn’t modify the agency to grip backup failures, so a organization tin mail away simply recover a failed backup every bit it would without CURP.

If a witness crashes or becomes non-responsive, the organization configuration director (the possessor of all cluster configurations) decommissions the crashed witness in addition to assigns a novel witness for the master; in addition to so it notifies the master copy of the novel witness list. When the master copy receives the notification, it syncs to backups to ensure f-fault tolerance in addition to responds dorsum to the configuration director that it is forthwith rubber to recover from the novel witness. After this point, clients tin mail away role f witnesses in i lawsuit to a greater extent than to tape operations.

The fault-tolerance becomes messy, inward my opinion. These paths are treated every bit exceptions, in addition to are non tested/exercised every bit business office of normal operations, in addition to thus are to a greater extent than prone to errors/bugs than the happy path.

Read operation 


It is possible to perform a read from a backup in addition to a witness. If the witness says at that spot is no asking that is non commutative alongside the requested item, the customer tin mail away read from the backup in addition to this volition live a linearizable read. I am non certain why a race status is non possible inward this case.

If the witness says, at that spot is a non-commutative functioning alongside the read, the read is sent to the master copy for getting served.

Evaluation

CURP is implemented inward RAMCloud in addition to Redis to demonstrate its benefits.
 

Related work 

This reminds me of the "Lineage Stash: Fault-tolerance off the critical path" newspaper inward SOSP'19. It had a similar idea. Perform asynchronous replication but have got plenty stash to live able to recover in addition to reconstruct the log if a failure occurs.

TAPIR is related work. I promise nosotros volition live reading the TAPIR newspaper inward our Zoom reading grouping soon.

What does Mad Danny say nigh the paper?

I asked Mad Danny nigh his views on the paper. This is what he had to say.

  1. The closest thing to this is MultiPaxos. I would remove MultiPaxos over this, because it is good established implementation. MultiPaxos has really practiced throughput due to pipelining (see 2). Yes, it has 2RTT latency according to the touchstone used inward the CURP paper. But inward reality it has 1RTT + 1 rtt latency. The all caps RTT is from the customer to the leader in addition to back. The lower instance rtt is from the leader to the followers in addition to back. In practice, the customer is significantly further away from the leader that these RTTs are different.
  2. MultiPaxos has really practiced throughput through the pipelining. The newspaper shows that CURP has 3.8x throughput than primary/backup because it cheats a little. It uses a minor break of clients which are sending the operations inward a synchronous way: the clients block until the previous functioning is acknowledged. When at that spot are plenty clients (even when those clients are doing synchronous operation) the organization is pushed to total utilization in addition to MultiPaxos (or primary/backup) volition laissez passer the same in addition to perhaps fifty-fifty ameliorate throughput than CURP, since CURP would have got to bargain alongside conflicts in addition to sync. Check Figure 8 inward the evaluation to see how speedily the throughput plummets inward the presence of conflicts. Optimistic protocols, similar EPaxos, all endure from this problem. 
  3. The tail-latency is bad inward CURP. The customer needs to hear from ALL witnesses in addition to the master. So the functioning speed is express past times the slowest node speed. MultiPaxos waits to hear from f out of 2f+1, the functioning is every bit fast every bit the fastest one-half of the nodes, in addition to is immune to the stragglers. This bespeak was mentioned in addition to explored inward Physalia
  4. The complexity for fault-tolerance reconfiguration that CURP introduces is non acceptable for a production system. There may live several cornercases in addition to concurrency race atmospheric condition inward the recovery in addition to reconfiguration path. The terms of reconfiguration is high, in addition to reconfiguration may live triggered unnecessarily inward a false-positive trend inward deployment. 

0 Response to "Exploiting Commutativity For Practical Fast Replication (Nsdi'19)"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel