Paper Review. An Empirical Report On Crash Recovery Bugs Inwards Large-Scale Distributed Systems

Crashes happen. In fact they occur so unremarkably that they are classified equally anticipated faults. In a large cluster of several hundred machines, yous volition have got i node crashing every couplet of hours. Unfortunately, equally this newspaper shows, nosotros are however non rattling competent at treatment crash failures.

This newspaper from 2018 presents a comprehensive empirical study of 103 crash recovery bugs from 4 pop open-source distributed systems: ZooKeeper, Hadoop MapReduce, Cassandra, in addition to HBase. For all the studied bugs, they analyze their source causes, triggering conditions, põrnikas impacts in addition to fixing.

Summary of the findings

Crash recovery bugs are caused past times 5 types of põrnikas patterns:
  • incorrect backup (17%)
  • incorrect crash/reboot detection (18%)
  • incorrect nation identification (16%)
  • incorrect nation recovery (28%) 
  • concurrency (21%)

Almost all (97%) of crash recovery bugs involve no to a greater extent than than 4 nodes. This finding indicates that nosotros tin observe crash recovery bugs inwards a pocket-size laid of nodes, rather than thousands.

A bulk (87%) of crash recovery bugs require a combination of no to a greater extent than than 3 crashes in addition to no to a greater extent than than i reboot. It suggests that nosotros tin systematically bear witness almost all node crash scenarios amongst rattling express crashes in addition to reboots.

Crash recovery bugs are hard to fix. 12% of the fixes are incomplete, in addition to 6% of the fixes solely cut the possibility of põrnikas occurrence. This indicates that novel approaches to validate crash recovery põrnikas fixes are necessary.

A generic crash recovery model 

The study uses this model for categorizing parts of crash-recovery of a system.


The study leverages on the existing cloud põrnikas study database, CBS. CBS contains 3,655 vital issues from vi distributed systems (ZooKeeper, Hadoop MapReduce, Cassandra, HBase, HDFS in addition to Flume), reported from Jan 2011 to Jan 2014. The dataset of the 103 crash recovery bugs this newspaper analyzes is available at http://www.tcse.cn/ wsdou/project/CREB.


Root cause


Finding 1: In 17/103 crash recovery bugs, in-memory information are non backed up, or backups are non properly managed.

Finding 2: In 18/103 crash recovery bugs, crashes in addition to reboots are non detected or non timely detected. (E.g., when a node crashes, only about other relevant nodes may access the crash node without perceiving that the node has crashed, in addition to they may in addition to so hang or throw errors. Or, if a crash node reboots rattling quickly, in addition to so the crash may last overlooked past times the crash detection factor based on timeout. Thus, the crash node may incorporate corrupted states, in addition to the organisation has nodes whose nation is out-of-synch, violating invariants.)

Finding 3: In 17/103 crash recovery bugs, the states afterwards crashes/reboots are incorrectly identified. (E.g., the recovery procedure mistakenly considers wrong states equally incorrect: the node may holler upward it is however the leader afterwards recovery.)

Finding 4: The states afterwards crashes/reboots are incorrectly recovered inwards 29/103 crash recovery bugs. Among them, fourteen bugs are caused past times no treatment or untimely treatment of sure enough leftovers.

Finding 5: The concurrency caused past times crash recovery processes is responsible for 22/103 crash recovery bugs. (Concurrency betwixt a recovery procedure in addition to a normal execution amounts to xiii bugs, concurrency betwixt ii recovery processes amount to 4 bugs, in addition to concurrency inside i recovery procedure amounts to 5 bugs.)

Finding 6: All the 7 recovery components inwards our crash recovery model tin last wrong in addition to innovate bugs. About i 3rd of bugs are caused inwards crash treatment component. (Overall, the crash treatment factor is the most error-prone (34%). The side past times side transcend 3 components, backing upward (19%), local recovery (17%) in addition to crash detection (13%) also occupy a large portion.)

Finding 7: In 15/103 crash recovery bugs, novel relevant crashes occur on/during the crash recovery process, in addition to thence trigger failures.


So it seems that for 85 out of 103 bugs (excluding the 17 bugs from Finding 1) state-inconsistency across nodes is the culprit. I don't desire to last the guy who ever plugs "invariant-based design" but, come upward on... Invariant-based pattern is what nosotros demand hither to preclude the problems that arose from operational reasoning. In operational reasoning, yous outset amongst a "happy path", in addition to and so yous attempt to figure out "what could move wrong?" in addition to how to preclude them. Of course, yous ever autumn curt inwards that enumeration of job scenarios in addition to overlook corner cases, race conditions, in addition to cascading failures. In contrast, invariant-based reasoning focuses on "what needs to move right?" in addition to how to ensure this properties equally invariants of your organisation at all times. Invariant-based reasoning takes a principled state-based rather than operation/execution-based catch of your system. To accomplish invariant-based reasoning, nosotros specify security in addition to liveness properties for our models. Safety properties specify "what the organisation is allowed to do" in addition to ensures "nothing bad happens". For example, at all times, all committed information is introduce in addition to correct. Liveness properties specify "what the organisation should eventually do" in addition to ensures "something skilful eventually happens". For example, whenever the organisation receives a request, it must eventually response to that request.

Here is another declaration for using invariant-based design, in addition to equally an  example, you tin depository fiscal establishment jibe my postal service on the two-phase commit protocol.

Triggering conditions

Finding 8: Almost all (97%) crash recovery bugs involve 4 nodes or fewer.

Finding 9: No to a greater extent than than 3 crashes tin trigger almost all (99%) crash recovery bugs. No to a greater extent than than i reboot tin trigger 87% of the bugs. In total, a combination of no to a greater extent than than 3 crashes in addition to no to a greater extent than than i reboot tin trigger 87% (90 out of 103) of the bugs.

Finding 10: 63% of crash recovery bugs require at to the lowest degree i customer request, but 92% of the bugs require no to a greater extent than than 3 user requests.

Finding 11: 38% of crash recovery bugs require complicated input conditions, e.g., especial configurations or background services.

Finding 12: The timing of crashes/reboots is of import for reproducing crash recovery bugs.

Bug impact

Finding 13: Crash recovery bugs ever have got severe impacts on reliability in addition to availability of distributed systems. 38% of the bugs tin campaign node downtimes, including cluster out of service in addition to unavailable nodes.

Finding 14: Crash recovery bugs are hard to fix. 12% of the fixes are incomplete, in addition to 6% of the fixes solely cut the possibility of põrnikas occurrence.

Finding 15: Crash recovery põrnikas fixing is complicated. Amounts of developer efforts were spent on these fixes.


MAD questions

1. Why is this non a solved problem?
The crash faults have got been amongst us for a long time. They have got teach fifty-fifty to a greater extent than relevant amongst the advent of containers inwards cloud computing, which may last shutdown or migrated for resources management purposes. If so, why are nosotros however non rattling competent at treatment crash-recovery?

It fifty-fifty seems similar nosotros have got only about tool back upward equally well. There are a lot of write-ahead-logging available to bargain amongst the bugs inwards Finding 1 related to backing upward in-memory data. (Well, that is assuming yous have got a skilful grasp on which information are of import to backup via write-ahead-logging.) We tin job ZooKeeper for keeping configuration information, so that the nodes involved don't have got differing opinions virtually which node is down. While keeping the configuration inwards ZooKeeper helps alleviate only about of the state-inconsistency problems, nosotros however demand invariant-based pattern (and model-checking the protocol) to brand sure enough the protocol does non endure from state-inconsistency problems.

This is a sincere question, in addition to non a covert way to propose that developers are existence sloppy. I know ameliorate than that in addition to I have got a lot of honor for the developers. That agency the job is to a greater extent than gummy hairy at the implementation level, in addition to our high-level abstractions are leaking. There has been relevant operate "On the complexity of crafting crash-consistent applications" inwards OSDI'14. There is also recent followup operate on "Protocol-Aware Recovery for Consensus-Based Distributed Storage" which I similar to read soon.

Finally, John Ousterhout had work on write-ahead-logging in addition to recovery inwards in-memory systems equally role of RamCloud project, in addition to I should depository fiscal establishment jibe recent operate from that group.


2. How does this relate to crash-only software?
It is unfortunate that the crash-only software paper has non been cited in addition to discussed past times this paper, because I holler upward crash-only software suggested a good, albeit radical, way to grip crash-recovery. As the findings inwards this newspaper present a large role of the argue the bugs occur is because the crash-recovery paths are non exercised/tested plenty during evolution in addition to fifty-fifty normal use. The "crash solely software" operate had the insight to do crashes equally role of normal use: "Since crashes are unavoidable, software must last at to the lowest degree equally good prepared for a crash equally it is for a construct clean shutdown. But in addition to so --in the spirit of Occam's Razor-- if software is crash-safe, why back upward additional, non-crash mechanisms for shutting down? A crash-only organisation makes it affordable to transform every detected failure into component-level crashes; this leads to a unproblematic fault model, in addition to components solely demand to know how to recover from i type of failure."


3. How does this compare amongst TaxDC paper? 
The TaxDC newspaper studied distributed coordination (DC) bugs inwards the cloud and showed that to a greater extent than than 60% of DC bugs are triggered past times a unmarried untimely message delivery that commits guild violation or atomicity violation, amongst regard to other messages or computation. (Figure 1 shows possible triggering patterns.) While this claim sounds similar a rattling striking in addition to surprising finding, it is truly straightforward. What is a DC bug? It is a manifestation of nation inconsistency across processes. What makes the nation inconsistency into a bug? A communication/message-exchange betwixt the ii inconsistent processes.

Compared to the TaxDC paper, this newspaper focuses on a pocket-size laid of bugs, solely the crash-recovery bugs. In comparing to the TaxDC paper, the newspaper states that crash recovery bugs are to a greater extent than probable to campaign fatal failures than DC bugs. In contrast to 17% of DC bugs, a whopping 38% of the crash-recovery bugs caused additional node downtimes, including cluster out of service in addition to unavailable nodes.

0 Response to "Paper Review. An Empirical Report On Crash Recovery Bugs Inwards Large-Scale Distributed Systems"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel