Paper Review: Implementing Linearizability At Large Scale In Addition To Depression Latency (Sosp'15)

Motivation for the paper

Remember the 2-phase commit problem? The 2-phase commit blocks when the initiator crashes. Or, after a server crash the client/initiator may non live on able to determine whether the transaction is completed. And transaction blocks. Because if nosotros retry nosotros may cease upwards giving inconsistent outcome or redoing the transaction which messes upwards linearizability.

You postulate to own got 3-phase commit to ready this. The new-leader/recovery-agent comes together with tries to recommit things. Unfortunately, 3-phase commit solutions are complicated, at that topographic point are a lot of corner cases. Lamport together with Gray recommended that the Paxos consensus box tin sack live on used to retrieve the initiator's abort commit determination to make consistency, or to a greater extent than just they recommended the Paxos box remembers each participants determination for the sake of shaving off a message inwards critical latency path.

What this newspaper recommends is an option approach. Don't utilization Paxos box to retrieve the decision, instead utilization a durable log at the player to retrieve the decision. At this log the player stores a completion record, which includes whatever results that are returned to the client. So if the initiator is confused together with retries, or if the customer retries, or a recovery-agent from 1 participating server comes together with retries, this querying political party is non going to larn an inconsistent answer/decision from what is committed/returned before from the transaction.

How is the log at the player durable against the crash of the participant? In other words, how produce nosotros ensure that the completion tape is preserved? This is where this supposition nearly fast log-based recovery and RAMCloud specific features comes into play. RAMCloud maintains a log-structured replication together with quick recovery, that ensures the completion tape is non lost.

The newspaper presents this durable log-based transaction serializability thought amongst unmarried participant, i.e., unmarried object transaction, together with and then shows that it tin sack live on extended to multiple player transactions.

That was my plot describe of piece of work for motivating the approach inwards the paper. The newspaper used, what I intend is an indirect agency to motivate the problem, past times start pointing a employment amongst linearizability: exactly-once semantics. The figure illustrates that "at to the lowest degree 1 time + idempotence != exactly-once". The newspaper together with then presents the completion tape thought to make exactly-once semantics, together with and then builds linearizability on transcend of it, together with inwards plough builds transactions on transcend of it.

Another thought inwards the newspaper is "implementing transactions on transcend of a lightweight linearizability layer". The newspaper argues that after having a lightweight linearizability layer inwards place, transations inwards fact larn easier to implement. We volition revisit this thought at the cease of the review to run into how it holds up.

RIFL architecture

The lightweight linearizability layer the newspaper suggests is named RIFL (Reusable Infrastructure for Linearizability).

In guild to implement exactly-once semantics, RIFL must solve iv problems: RPC identification, completion tape durability, retry rendezvous, together with garbage collection.
1) In guild to uncovering redundant RPCs, each RPC must own got a unique identifier, which is acquaint inwards all invocations of that RPC.
2) RIFL assumes that the underlying arrangement provides durable storage for completion records keyed amongst the RPC identifier.
3) If an RPC completes together with is together with then retried at a subsequently time, the retries must uncovering the completion tape to avoid re-executing the operation. To make this each functioning is associated amongst a item object inwards the underlying system, together with the completion tape is stored wherever that object is stored.
4) For garbage collection, a completion tape should non live on reclaimed until it is sure that the corresponding asking volition never live on retried. This tin sack occur inwards 2 ways. First, 1 time the customer has received a response, it volition never retry the request. Clients render acknowledgments to the servers nearly which requests own got successfully completed, together with RIFL uses the acknowledgments to delete completion records.

RIFL appears to the balance of the arrangement every bit iii modules. The first, RequestTracker, runs on customer machines to contend sequence numbers for outstanding RPCs (Figure 3). The minute module, LeaseManager, runs on both clients together with servers to contend customer leases (Figure 4). On clients, LeaseManager creates together with renews the client’s lease, which besides yields a unique identifier for the client. On servers, LeaseManager detects the expiration of customer leases. The tertiary module, ResultTracker, runs alone on servers: it keeps runway of currently executing RPCs together with manages the completion records for RPCs that own got finished (Figure 5).

Implementing transactions over RIFL

The newspaper shows how Sinfonia minitransactions tin sack live on implemented over RIFL layer. You tin sack read a summary of Sinfonia minitransactions here. The implementation of Sinfonia transactions over RIFL requires a long description, thence I volition avoid summarizing it myself, together with instead dot to a dyad paragraphs verbatim from the newspaper to laissez passer on yous an thought nearly this.

"No side-effects" is the telephone commutation thought when implementing transactions. The Transaction object defers all updates to the key-value shop until commit is invoked. Commit must atomically verify that each object has the required version number, together with then apply all of the write together with delete operations. If whatever of the version checks fail, the commit aborts together with no updates occur.

Commit is implemented using a two-phase protocol where the customer serves every bit coordinator. In the start phase, the customer issues 1 laid upwards RPC for each object involved inwards the transaction (see Figure 6). The server storing the object (called a participant) locks the object together with checks its version number. If it doesn't check the desired version together with then the player unlocks the object together with returns ABORT; it besides returns ABORT if the object was already locked past times around other transaction. Otherwise the player stores information nearly the lock inwards a transaction lock tabular array together with creates a durable tape of the lock inwards its log. It together with then returns PREPARED to the client. The customer issues all of the laid upwards RPCs concurrently together with it batches requests to the same participant. If all of the laid upwards RPCs render PREPARED, together with then the commit volition succeed; if whatever of the laid upwards RPCs render ABORT, together with then the transaction volition abort. In either case, the customer together with then enters the minute phase, where it issues a determination RPC for each object. The player for each object checks whether the RPC indicates “commit” or “abort”. If the determination was to commit, it applies the mutation for that object, if any. Then, whether committing or aborting, it removes the lock tabular array entry together with adds a tombstone tape to the RAMCloud log to nullify the lock record. The transaction is effectively committed 1 time a durable lock tape has been written for each of the objects. At this point, regardless of the client's actions, the mutations volition eventually live on applied inwards the crash recovery.

If the customer is suspected to own got crashed, the player for the transaction's start object acts every bit recovery coordinator. The recovery coordinator executes a two-phase protocol like to that of the client, except that its goal is to abort the transaction unless it has already committed (in general, at that topographic point may non live on plenty information to consummate an incomplete transaction, since the customer may own got crashed before issuing all of the laid upwards RPCs). In the start stage the recovery coordinator issues a requestAbort RPC for each object, whose player volition concord to abort unless it has already accepted a laid upwards for the object. If requestAbort returns PREPARED for every object inwards the transaction, together with then the transaction has already committed. Otherwise, the recovery coordinator volition abort the transaction. In either case, the recovery coordinator together with then sends determination RPCs for each object. In guild to render plenty information for crash recovery, the customer includes identifiers for all objects inwards the transaction every bit role of each prepare. This information serves 2 purposes. First, it allows each player to position the recovery coordinator (the server for the start object inwards the list). Second, the object listing is needed past times the recovery coordinator to position participants for its requestAbort together with determination RPCs. The recovery coordinator may non own got received a laid upwards if the customer crashed, thence when a player invokes startRecovery it includes the listing of objects that it received inwards its prepare.

Transactions over Linearizability vs. Linearizability over Transactions

Implementing transactions over linearizability for sure provided a lot of produce goodness inwards price of simplifying the complex recovery protocol inwards the master copy Sinfonia system. By implementing Sinfonia over RIFL, the newspaper did non postulate to implement the recovery protocol of Sinfonia. On the other mitt nosotros don't own got results from the newspaper to run into if implementing Sinfonia over RIFL helped amend performance. The newspaper does non include whatever comparing of performance amongst base of operations Sinfonia transactions. Or the newspaper could own got at to the lowest degree measured throughput of base of operations RAMCloud transaction together with RAMCloud amongst RIFL transactions to run into if at that topographic point is a throughput increase. That is besides missing sorely inwards the paper.

As far every bit full general transactions together with full general linearizability is concerned: the RIFL newspaper doesn't compare with Calvin, a arrangement that made the illustration for transactions over linearizability. This is a big omission every bit Calvin arrangement laid out to produce simply that, implementing transactions over a linearized log: “By start writing transaction requests to a durable, replicated log, together with and then using a concurrency command machinery that emulates a deterministic series execution of the log's transaction requests, Calvin supports strongly consistent replication together with fully ACID distributed transactions, together with manages to incur lower inter-partition transaction coordination costs than traditional distributed database systems.”

There is besides the inquiry of throughput vs. latency for transactions. The Calvin newspaper suggests that transactions over linearizability improves throughput but latency suffers. I intend the ground is every bit follows: linearizability-first avoids coordination headache/contentions. So eventhough it may add together to latency, it should assistance amend throughput overall because coordination controversy is avoided. Things are all predecided past times the linearizability layer, aka the log.

Evaluation section

The evalution department shows the practical implications of the shim lightweight linearizability layer together with demonstrates that this provides depression overhead transactions fifty-fifty every bit the number of participants increase.

The RIFL+RAMCloud stack is compared/contrasted amongst H-Store main-memory database arrangement using the TPC-C benchmark. The RAMCloud implementation of RIFL exhibits high performance: it adds less than 4% to the 13.5 μs base of operations toll for writes, together with elementary distributed transactions execute inwards nearly twenty μs. RAMCloud transactions outperform H-Store on the TPC-C benchmark, providing at to the lowest degree 10x lower latency together with 1.35x–7x every bit much throughput.

To contextualize these comparing results, keep inwards hear that H-Store is a totally unlike brute than RAMCloud. H-Store is a row-store-based main-memory relational database management arrangement for OLTP applications. So H-Store is definitely to a greater extent than heavyweight to bargain amongst full general transactions together with relational databases.

Figure 12 graphs the throughput of RIFL-based transactions. For transactions involving a unmarried participant, a cluster amongst 10 servers tin sack back upwards nearly 440k transactions per second. With to a greater extent than participants per transaction, throughput drops roughly inwards proportion to the number of participants: amongst 5 participants inwards each transaction, the cluster throughput is nearly 70k transactions per second. Figure 12 besides shows that the single-server optimization improves throughput past times nearly 40%.