Paper Review: Elementary Testing Tin Foreclose Close Critical Failures: An Analysis Of Production Failures Inward Distributed Data-Intensive Systems

This newspaper appeared inwards OSDI'14 in addition to is written yesteryear Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, in addition to Michael Stumm, University of Toronto.

While the championship mentions "analysis of production failures", the newspaper is non concerned/interested inwards rootage drive analysis of failures, instead, the newspaper is preoccupied amongst problems inwards error treatment that Pb to failures.

I enjoyed this newspaper a lot. The newspaper has interesting counterintuitive results. It opens to give-and-take the error treatment issue, in addition to especially the Java exception handling. Digesting in addition to interpreting the findings inwards the newspaper volition require time. To contribute to this process, hither is my accept on the findings ---after a quick summary of the paper.

The setup in addition to terminology

The newspaper studied 198 randomly sampled user-reported failures of five opensource data-intensive distributed systems: Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, in addition to Redis. Table 1 shows how the sampling is done. (As a superficial observation, it seems similar HBase is really buggy where 50% of sampled failures are catastrophic, in addition to Cassandra is good engineered.) Almost all of these systems are written inwards Java, thence Java's exception-style error treatment plays a pregnant role inwards the findings of the paper.

Definitions: A fault is the initial rootage cause, which could hold out a hardware malfunction, a software bug, or a misconfiguration. A fault tin strength out arrive at abnormal behaviors referred to every bit errors, such every bit Java exceptions. Some of the errors volition receive got no user-visible side-effects or may hold out appropriately handled yesteryear software; other errors manifest into a failure, where the organization malfunction is noticed yesteryear halt users or operators.

The newspaper defines catastrophic failure every bit "failures where most or all users sense an outage or information loss". Unfortunately this is a really vague definition. Table 2 provides to a greater extent than or less illustration categories for catastrophic failures considered.

Findings

Table three shows that unmarried lawsuit input failures are relatively low, this is in all probability because these systems are well-tested amongst unit of measurement tests every bit they are heavily used inwards production. On the other hand, the predominating instance is where 2 events conspire to trigger the failures.

From Table 4, what jumps upwardly is that starting upwardly services is especially problematic. Incidentally, in airplanes most fatal accidents occur during the climbing stage.

Table five shows that almost all (98%) of the failures are guaranteed to manifest on no to a greater extent than than three nodes, in addition to 84% volition manifest on no to a greater extent than than 2 nodes. For large-scale distributed systems, three nodes beingness sufficient to manifest almost all failures seems surprisingly low. Of course, this newspaper looks at data-intensive distributed systems, which may non hold out representative of full general distributed systems. In whatsoever case, these numbers don't surprise me every bit they concur with my sense using TLA+ inwards verifying distributed algorithms.

Deterministic failures totaling at 74% of all failures is goodness news. Deterministic failures are low-hanging fruit, they are easier to fix.

Catastrophic failures

Figure five shows a break-down of all catastrophic failures yesteryear their error handling. Based on this figure, the newspaper claims that "almost all (92%) of the catastrophic organization failures are the resultant of incorrect handling of non-fatal errors explicitly signaled inwards software".

But, it seems to me that this provocative contestation is due to a broad/vague definition of "incorrect error handling". If you lot utilization a broad/vague definition of "incorrect landing", you lot tin strength out say that every catastrophic flat failure is an wrong landing problem. Java casts everything into an error exception, in addition to thence every fault materializes/surfaces every bit an exception. But, does that hateful if nosotros arrive at a goodness chore on exception handling, at that spot volition hold out almost no catastrophic failures? That is an wrong assumption. Sometimes the correction required needs to hold out invasive (such every bit resetting nodes) in addition to the correction also counts every bit a catastrophic failure.

And how tin strength out nosotros arrive at a goodness chore on error handling? The newspaper does non render help on that. Correct error-handling is really hard: You ask a lot of context in addition to a holistic agreement of the system, in addition to that is in all probability why error-handling has been done sloppily inwards the systems studied.

The newspaper also claims: "in 58% of the catastrophic failures, the underlying faults could easily receive got been detected through elementary testing of error treatment code." I tin strength out concur amongst the 35%, every bit they consist of piffling mistakes inwards exception handling, such every bit an empty error treatment block. But for including the other 23% labeling them every bit easily detectable, I mean value nosotros should practise caution in addition to non dominion out the hindsight bias.

To preclude this 23% failures, the newspaper suggests 100% contestation coverage testing on the error treatment logic. To this end, the newspaper suggests that opposite technology scientific discipline essay out cases that trigger them. At the halt of the day, this boils to thinking difficult to meet how this tin strength out hold out triggered. But how arrive at nosotros know when to quit in addition to when nosotros got enough? Without a rule, this tin strength out larn cumbersome.

Figure10 shows an illustration of the 23% easily detectable failures. That illustration however looks tricky to me, fifty-fifty afterward nosotros practise the hindsight advantage.

Speculation close the causes for hapless error-handling

Maybe i argue for hapless error-handling inwards the studied systems is that features are sexy, but fault-tolerance is not. You meet features, but you lot don't meet fault-tolerance. Particularly, if you lot arrive at fault-tolerance right, nobody notices it.

But, that is a simplistic answer. After all, these systems are unremarkably used inwards production, in addition to they are well-tested amongst the FindBugs tool, unit of measurement tests, in addition to fault injection. Then, why arrive at they however suck thence badly at their exception-handling code? I speculate that perchance at that spot is a scarcity of "expert generalists", developers that empathise the projection every bit a whole in addition to that tin strength out write error-handling/fault-tolerance code.

Another argue of course of study could hold out that developers may mean value a item exception volition never arise. (Some comments inwards the exception handlers inwards the five systems studied hint at that.) That the developers cannot anticipate how a item exception tin strength out hold out raised doesn't hateful it won't hold out raised. But, perchance nosotros should include this instance inside the higher upwardly instance that says at that spot is a scarcity of fault-tolerance experts/generalists inwards these projects.

"Correct" error handling

While the newspaper pushes for 100% error treatment coverage, it doesn't offering techniques for "correct" error handling. Correct error treatment requires rootage drive analysis, a holistic sentiment of the system, in addition to fault-tolerance expertise.

So at that spot is no silverish bullet. Exceptions inwards Java, in addition to error treatment may facilitate to a greater extent than or less things, but at that spot is no shortcut to fault-tolerance. Fault-tolerance volition ever receive got a price.

The cost of reliability is the pursuit of the utmost simplicity.
It is a cost which the really rich expose most difficult to pay.
C.A.R. Hoare

Possible hereafter directions

Since 2-input triggered corruptions are predominant, perchance nosotros should consider pairwise testing inwards improver to unit of measurement testing to guard against them. Unit testing is goodness for enforcing a component's local invariants. Pairwise testing tin strength out enforce an interaction invariant betwixt 2 components, in addition to guard against 2-input triggered corruptions.

I mean value the principal argue it is difficult to write "correct error handling" code is that the exception handler doesn't receive got plenty information to know the proper means to right the state. This is however futuristic, but if nosotros had an eidetic system, in addition to thence when exceptions hit, nosotros could telephone telephone the corrector, which tin strength out arrive at a root-cause analysis yesteryear doing a backwards enquiry on the eidetic system, in addition to afterward determining the occupation could arrive at a frontward enquiry to figure out what are just the things that ask to hold out fixed/corrected every bit a resultant of that fault.