Paper Summary: Tango: Distributed Information Structures Over A Shared Log

This newspaper is from the Microsoft Research Silicon Valley (which unfortunately late got closed), as well as it appeared inwards SOSP'13. SOSP'13 provides opened upwardly access, hence here is the pdf for free. The verbalize video is besides on YouTube equally component of this SOSP'13 talks playlist. I intend this newspaper didn't acquire the attending it deserves. It is actually a bully slice of work.

To facilitate construction of highly available metadata services, Tango provides developers amongst the abstraction of a replicated in-memory information construction (such equally a map or a tree) backed yesteryear a shared log.

While ZooKeeper provides developers a fixed information construction (the information tree) for edifice coordination primitives, Tango enables clients to build unlike information structures based on the same single shared log. Tango besides provides transactions across information structures.

The nation of a Tango object exists inwards 2 forms. 1) a history: which is an ordered sequence of updates stored durably inwards the shared log, 2) whatever number of views: which are total or partial copies of the information construction --such equally a tree or a map-- constructed from the log as well as stored inwards RAM on clients (i.e., application servers).

A customer modifies a Tango object yesteryear appending a novel update to the history; it accesses the object yesteryear starting fourth dimension synchronizing its local persuasion amongst the history. Views are soft nation as well as are instantiated, reconstructed, as well as updated on clients yesteryear playing the shared history forward.

In Tango, the shared log provides: consistency, durability, history. Tango besides provides atomicity as well as isolation for transactions across unlike objects yesteryear multiplexing & storing them on a unmarried shared log.

Corfu shared log abstraction

Tango builds on the Corfu shared log abstraction, which employs flash disks to alleviate the concerns nigh the read from the history of the log, piece writes are going on at the caput of the log.

The CORFU interface consists of four calls:

  1. Clients tin give notice append entries to the shared log, obtaining an offset inwards return.
  2. They tin give notice banking corporation check the electrical flow tail of the log. 
  3. They tin give notice read the entry at a detail offset.
  4. Clients tin give notice cut back a detail offset inwards the log for garbage collection.
Corfu organizes a cluster of storage nodes into multiple, disjoint replica sets; for example, a 12-node cluster mightiness consist of four replica sets of size 3. Each private storage node exposes a 64-bit write-once address space, mirrored across the replica set. The cluster besides contains a dedicated sequencer node, which is essentially a networked counter storing the electrical flow tail of the shared log.

To append, a customer contacts the sequencer as well as obtains the adjacent gratuitous offset inwards the global address infinite of the shared log. It as well as then maps this offset to a local offset on 1 of the replica sets using a uncomplicated deterministic mapping (e.g., modulo function) over the membership of the cluster. The customer as well as then completes the append yesteryear straight issuing writes to the storage nodes inwards the replica laid using a client-driven variant of Chain Replication.

The sequencer is exactly an optimization to expose the tail of the log as well as non required for correctness. The Chain Replication variant used to write to the storage nodes guarantees that a unmarried customer volition "win" if multiple clients endeavour to write to the same offset. When the sequencer goes down, whatever customer tin give notice easily recover this nation using the ho-hum banking corporation check functioning on the shared log.

The Tango architecture

There are three components to a Tango object. 1) A Tango object contains the view, which is an in-memory representation of the object inwards some form, such equally a listing or a map. E.g., for TangoRegister this nation is a unmarried integer. 2) Each object implements the mandatory apply upcall which changes the persuasion when the Tango runtime calls it amongst novel entries from the log. By customizing the apply implementation, 1 customer tin give notice build a "tree view" piece some other builds a "set view" reading from the same log. 3) Each object exposes an external interface of object-specific mutator as well as accessor methods; e.g., the TangoRegister exposes read/write methods.
The object's mutators do non straight alter the in-memory nation of the object. Instead, each mutator combines its parameters into an opaque buffer --an update record-- as well as calls the update helper function of the Tango runtime, which appends it to the shared log.

Similarly, the accessors do non directly read the object's state. Each accessor starting fourth dimension calls the query helper before returning an arbitrary component over the nation of the object. The query helper plays novel update records inwards the shared log until its electrical flow tail as well as applies them to the object via the apply upcall earlier returning.

Storing multiple objects on a unmarried shared log enables strongly consistent operations across them without requiring complex distributed protocols.  The Tango runtime on each customer tin give notice multiplex the log across objects yesteryear storing as well as checking a unique object ID (OID) on each entry. Such a scheme has the drawback that every customer has to play every entry inwards the shared log, but layered partitioning, equally nosotros shall hash out soon, solves this problem. It enables strongly consistent operations across objects without requiring each object to live hosted yesteryear each client, as well as without requiring each customer to eat the entire shared log.

Transactions 

Tango implements optimistic concurrency control by appending speculative transaction commit records to the shared log.  Commit records ensure atomicity, since they determine a indicate inwards the persistent total ordering at which the changes that hap inwards a transaction tin give notice live made visible at all clients. To furnish isolation, each commit tape contains a read set: a listing of objects read yesteryear the transaction along amongst their versions, where the version is simply the finally offset inwards the shared log that modified the object. A transaction solely succeeds if none of its reads are stale when the commit tape is encountered (i.e., the objects cause got non changed since they were read).

To announce a transaction, calls to object accessors as well as mutators tin give notice live bracketed yesteryear BeginTX as well as EndTX calls. BeginTX creates a transaction context inwards thread-local storage. EndTX appends a commit tape to the shared log, plays the log forrard until the commit point, as well as and then makes a commit/abort decision.

Each customer that encounters the commit tape decides --independently but deterministically-- whether it should commit or abort yesteryear comparison the versions inwards the readset amongst the electrical flow versions of the objects. If none of the read objects cause got changed since they were read, the transaction commits as well as the objects inwards the write laid are updated amongst the apply upcall.

For read-only transactions, the EndTX telephone telephone does non insert a commit tape into the shared log; instead, it exactly plays the log forrard until its electrical flow tail earlier making the commit/abort decision. Tango besides supports fast read-only transactions from stale snapshots yesteryear having EndTX brand the commit/abort determination locally, without interacting amongst the log.

Write-only transactions require an append on the shared log but tin give notice commit directly without playing the log forward.

Layered partitions

Each customer hosts a (possibly overlapping) division of the global nation of the system, but this partitioning scheme is layered over a unmarried shared log.  To efficiently implement layered partitions without requiring each customer to play the entire shared log, Tango maps each object to a flow over the shared log.

A flow augments the conventional shared log interface (append as well as random read) amongst a streaming readnext call.  Many streams tin give notice co-exist on a unmarried shared log; calling readnext on a flow returns the adjacent entry belonging to that flow inwards the shared log, skipping over entries belonging to other streams. With this interface, clients tin give notice selectively eat the shared log yesteryear playing the streams of involvement to them (i.e., the streams of objects hosted yesteryear them).

Each customer plays the streams belonging to the objects inwards its layered partition. But, streams are non necessarily disjoint; a multiappend call allows a physical entry inwards the log to belong to multiple streams. When transactions cross object boundaries, Tango changes the conduct of its EndTX telephone telephone to multiappend the commit tape to all the streams involved inwards the write set. Multiappend ensures the following. A transaction that affects multiple objects occupies a unmarried seat inwards the global ordering; inwards other words, in that place is solely 1 commit tape per transaction inwards the raw shared log. A customer hosting an object sees every transaction that impacts the object, fifty-fifty if it hosts no other objects.

Tango transactions has the next limitation though. Remote reads at the generating customer is disallowed inwards a transaction: a customer cannot execute transactions as well as generate commit records involving remote reads. Calling an accessor on an object that does non cause got a local persuasion is problematic, since the information does non be locally; possible solutions yesteryear invoking an RPC to a unlike customer amongst a persuasion of the object is expensive as well as complicated. So, if a customer wants to do a transaction amongst reads on an object, the customer should subscribe to the flow of that object.

Streaming Corfu

When the client-side library starts up, the application provides it amongst the listing of flow IDs of involvement to it. For each such stream, the library finds the finally entry inwards the shared log belonging to that flow yesteryear asking the sequencer. The K backpointers inwards this entry allow it to build a K-sized suffix of the linked listing of offsets comprising the stream. It as well as then issues a read to the offset pointed at yesteryear the Kth backpointer to obtain the previous K offsets inwards the linked list. In this manner, the library tin give notice build the linked listing yesteryear striding backward on the log, issuing N/K reads to build the listing for a flow amongst northward entries.

Evaluation

The experimental testbed consists of 36 8-core machines inwards 2 racks, amongst gigabit NICs on each node as well as xx Gbps betwixt the top-of-rack switches.  In all the experiments, they run an 18-node Corfu deployment on these nodes inwards a 9-by-2 configuration (i.e., nine sets of 2 replicas each), such that each entry is mirrored across racks. The other eighteen nodes are used equally clients. The Corfu sequencer runs on a powerful, 32-core automobile inwards a dissever rack. They utilization 4KB entries inwards the Corfu log, amongst a batch size of four at each client.
Figure shows unmarried object serializability. Reads await the apply upcalls from the stream. If no writes, the reads are of picayune cost. As to a greater extent than writes occur, reads convey to a greater extent than fourth dimension to grab up. Probably reads may convey to a greater extent than fourth dimension than writes inwards Tango, but this is non shown inwards the graphs.
 
Figure shows performance for a primary/backup scenario where 2 nodes host views of the same object, amongst all writes directed to 1 node as well as all reads to the other. Overall throughput falls sharply equally writes are introduced, as well as and then stays constant at some 40K ops/sec equally the workload mix changes; however, average read latency goes upwardly equally writes dominate, reflecting the extra piece of occupation the read-only 'backup' node has to do to catchup amongst the primary.
Figure shows elasticity of linearizable read throughput amongst multiple views.

Figure shows transactions over layered partitions.

Tango vs. ZooKeeper.
Using Tango, the authors build ZooKeeper (TangoZK, 1K lines), BookKeeper (TangoBK, 300 lines), TreeSets as well as HashMaps (100 to 300 lines each). The performance of the resulting implementation is real similar to the TangoMap numbers inwards Figure 10; for example, amongst eighteen clients running independent namespaces, they obtain some 200K txes/sec if transactions do non bridge namespaces, as well as nearly 20K txes/sec for transactions that atomically motion a file from 1 namespace to another. The capability to motion files across unlike instances does non be inwards ZooKeeper, which supports a limited cast of transaction inside a unmarried event (i.e., a multi-op telephone telephone that atomically executes a batch of operations).

They besides implemented the single-writer ledger abstraction of BookKeeper inwards some 300 lines of Java code (again, non counting Exceptions as well as callback interfaces). To verify that their ZooKeeper as well as BookKeeper were full-fledged implementations, they ran the HDFS namenode over them (modifying it solely to instantiate our classes instead of the originals) as well as successfully demonstrated recovery from a namenode reboot equally good equally fail-over to a backup namenode.

Discussion

Tango fits inside the State Machine Replication (SMR) paradigm, replicating nation yesteryear imposing a total ordering over all updates. In the vocabulary of SMR, Tango clients tin give notice live seen equally learners of the total ordering. The storage nodes comprising the shared log play the role of acceptors.

The findings inwards the Tango newspaper that a centralized server tin give notice live made to run at real high RPC rates matches recent observations yesteryear others. The Percolator arrangement runs a centralized timestamp oracle amongst similar functionality at over 2M requests/sec amongst batching. Vasudevan et al. (SOCC'12) study achieving 1.6M submillisecond 4-byte reads/sec on a unmarried server amongst batching. Masstree is a key-value server that provides 6M queries/sec amongst batching.

Tango's biggest contribution is that it provides multiple consistent object views from the same log. Objects amongst unlike in-memory information structures tin give notice portion the same information on the log. For example, a namespace tin give notice live represented yesteryear unlike trees, 1 ordered on the filename as well as the other on a directory hierarchy, allowing applications to perform 2 types of queries efficiently (i.e., "list all files starting amongst the missive of the alphabet B" vs. "list all files inwards this directory"). Strongly consistent reads tin give notice live scaled simply yesteryear instantiating to a greater extent than views of the object on novel clients. But is this free? Is this fast?

Tango's soft-belly is that it uses a pull-based approach of constructing the persuasion from the shared log. Wouldn't a push-based approach live to a greater extent than timely? When a read comes, the pull-based approach may cause got a lot of catching upwardly to do to the electrical flow nation earlier it returns an answer. I approximate it may live possible to copy this amongst periodic pulls, fifty-fifty when no accessor component is invoked.

Tango provides a weird combination of centralized as well as decentralized. The log is centralized as well as this is exploited to furnish serialization of distributed transactions. On the other hand, non having a original node as well as using the clients equally learners is a real decentralized approach. Instead of 1 original taking decisions as well as updating the information structure, all of the clients are playing the log as well as taking decisions (in a deterministic agency ensuring that they all brand the same decisions), as well as updating their information structures. This resembles Lamport's extremely decentralized (to a fault!) implementation of the usual exclusion which maintains replicated queues of all requests at all processes. (Of course, you lot tin give notice e'er code 1 customer equally original learner/decision-maker for other clients, as well as circumvent this!)

Tango vs. ZooKeeper.
Tango provides a better/higher-level programming back upwardly than ZooKeeper. What the Tango newspaper calls equally Tango clients are servers that furnish services for application-clients. (You may fifty-fifty say a Tango-client roughly corresponds to a "customized-view" ZooKeeper observer.) So, inwards damage of programmability as well as expressivity, Tango has the upper-hand. I presume using ZooKeeper for large-scale applications may pop off intractable as well as may effect inwards spaghetti-code since ZooKeeper provides a real minimalistic/low-level-primitives for coordination. Tango, on the other hand, lets the developer build higher grade abstractions of their ain coordination services at the Tango-clients, as well as this benefits managing large projects piece keeping complexity on a leash.

Comparing the efficiency of Tango as well as ZooKeeper, it seems similar ZooKeeper would live better. In Tango, in that place are span of indirections that are non nowadays inwards ZooKeeper. In Tango, in that place is an extra mensuration for sequencer node to acquire ticket/offset number. The Tango replication tin give notice correspond to ZooKeeper/Zab replication hence they equal out there. But, Tango has some other layer of indirection, where the clients demand to read as well as larn from the log. In ZooKeeper, since the leader is besides the determination maker, the app-client's learning tin give notice live from relatively compact state, whereas inwards Tango, this volition live through replaying a sequence of commands as well as yesteryear constructing the nation itself. Again, since Tango-client is similar the ZooKeeper observer, that is some other grade of indirection earlier going to the app-client inwards Tango. So inwards total, 2 extra-levels are nowadays inwards Tango (the sequencer contacting, as well as the Tango-client learning) that are non nowadays inwards ZooKeeper. Tango provides ameliorate programmability as well as expressivity but this comes amongst a trade-off at the performance.

If your application is uncomplicated (and volition remain simple), as well as tin give notice live implemented using ZooKeeper inwards a straightforward manner, it would live best to utilization ZooKeeper. Otherwise, yesteryear using Tango, you lot tin give notice cause got a better/extendible/tractable code-base, as well as potentially write some of your services equally Tango-client that tin give notice fifty-fifty improve the performance.

Final remarks

Tango code is non opened upwardly source. That is actually unfortunate, equally it could furnish a practiced option to ZooKeeper for some applications that require coordination as well as transactions across distributed clients.

Since the sequencer is centralized Tango is non suitable for WAN deployments.

Some questions soundless remain. The flow sharing assignments seems to live done statically using the layered flow abstraction API. Can nosotros do this on demand as well as dynamically?

How is the layered flow abstraction implemented at CORFU grade over the replica groups? Would it pay to dedicate 1 grouping for 1 pop stream? This would brand mass reading possible from that replica set. (Similar to the columnar storage idea.)

0 Response to "Paper Summary: Tango: Distributed Information Structures Over A Shared Log"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel