Sosp19 Lineage Stash: Error Tolerance Off The Critical Path

This newspaper is past times Stephanie Wang (UC Berkeley), John Liagouris (ETH Zurich), Robert Nishihara (UC Berkeley), Philipp Moritz (UC Berkeley), Ujval Misra (UC Berkeley), Alexey Tumanov (UC Berkeley), Ion Stoica (UC Berkeley).

I actually liked this paper. It has a uncomplicated idea, which has a skillful jeopardy of getting adopted past times existent basis systems. The presentation was rattling good done in addition to was rattling informative. You tin sentinel the presentation video here.

Low-latency processing is rattling of import for information processing, flow processing, graph processing, in addition to command systems. Recovering subsequently failures is likewise of import for them, because for systems composed of 100s of nodes, node failures are purpose of daily operation.

It seems similar at that topographic point is a tradeoff betwixt depression latency in addition to recovery time. The existing recovery methods either own got depression runtime overhead or depression recovery overhead, simply non both.
  • Global checkpoint approach to recovery achieves a depression runtime overhead, because a checkpoint/snapshot tin live on taken asynchronously in addition to off the critical path of the execution. On the other hand, the checkpoint approach has high recovery overhead because the entire organization needs to live on rolled dorsum to the checkpoint in addition to thence start from at that topographic point again.
  • Logging approach to recovery has high runtime overhead, because it synchronously records/logs every information close whatever nondeterministic execution subsequently the in conclusion checkpoint. On the flip side of the coin, it tin accomplish depression overhead to recovery because solely the failed processes demand to live on rolled dorsum a niggling in addition to resume from there. 


Can nosotros own got a recovery approach that achieves both depression runtime overhead in addition to depression recovery overhead? The newspaper proposes the "lineage stash" sentiment to accomplish that. The sentiment behind lineage stash is simple.

The commencement purpose of the sentiment is to cut the amount of information logged past times solely logging the lineage. Lineage stash logs the pointers to the information messages instead of the data, in addition to likewise logs chore descriptions inward illustration that information needs to live on recreated past times the previous operation. Lineage stash likewise logs the social club of execution.


The minute purpose of the sentiment is to create this lineage logging asynchronously, off the critical path of execution. The operators/processes at ane time include a local volatile cache for lineage, which is asynchronously flushed to the underlying remote global lineage storage. The global lineage shop is a sharded in addition to replicated key-value datastore.


But thence the interrogation becomes, is this withal error tolerant? If nosotros are doing the logging to the global lineage shop asynchronously, what if the procedure crashes earlier sending the message, in addition to nosotros lose the log information?

The concluding purpose of the sentiment is to exercise a causal logging approach, in addition to piggybacking the uncommitted lineage information to the other processes/operations for them to shop inward their stashes equally well. So this variety of resembles a tiny decentralized blockchain stored inward the stashes of interacting processes/operations.


In the figure, the filter procedure had executed to a greater extent than or less tasks in addition to thence passed messages to the counter process. Since the logging is off the critical path, the lineage for these tasks was non yet replicated to the global lineage stash. But equally purpose of the rule, the lineage was piggybacked to the messages sent to the counter, thence the counter has likewise a re-create of the lineage inward its stash, when the filter procedure crashed. Then inward the recovery, the counter procedure helps past times flushing this uncommitted lineage to the global lineage storage for persistence. The recovering filter procedure tin thence retrieve in addition to replay this lineage to accomplish a right in addition to quick (on the social club of milliseconds) recovery.

Lineage stash sentiment was implemented in addition to evaluated inward Apache Flink for a flow processing give-and-take count application over 32 nodes. It was compared against the default global checkpoint recovery, in addition to the default augmented amongst synchronous logging.


As the figure inward a higher house shows, past times using asynchronous logging approach, linear stash is able to avoid the runtime latency overhead of synchronized logging in addition to matches that of the asynchronous checkpointing approach. Moreover, equally the figure below shows, the recovery latency of checkpointing is rattling high. The lineage stash approach reaches similar recovery latency equally the syncronized logging approach.


The lineage stash looks rattling promising for providing lightweight (off the critical path) fault-tolerance for fine-grain information processing systems. I actually similar the simplicity of the idea. I experience similar I own got seen a related sentiment somewhere else equally well. But I can't quite retrieve it.

0 Response to "Sosp19 Lineage Stash: Error Tolerance Off The Critical Path"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel