[Paper Review] Taxdc: A Taxonomy Of Nondeterministic Concurrency Bugs Inwards Datacenter Distributed Systems

This newspaper appeared inwards ASPLOS 2016 as well as the authors are Leesatapornwongsa, Lukman, Lu, as well as Gunawi.

The newspaper provides a comprehensive report on real-world distributed concurrency bugs as well as is a practiced complement to the newspaper I reviewed before: "Why does cloud halt computing". While the previous newspaper looked at all possible bugs that atomic number 82 to service outages, this newspaper focuses exclusively on distributed concurrency (DC) bugs. This type of põrnikas happens to live my favorite form of bug. I am fascinated amongst "failures", as well as amongst DC bugs doubly so. DC bugs are execution timing/ordering/scheduling related, they occur nondeterministically, as well as are extremely hard to detect, diagnose, as well as ready inwards production systems. I honey seeing those long traces of improbable DC bugs surface when I model cheque distributed algorithms inwards TLA+. The sense keeps me humble.

The newspaper presents analysis of 104 DC bugs from Cassandra, HBase, Hadoop MapReduce, as well as ZooKeeper. These bugs are categorized as well as studied according to the triggering timing status as well as input preconditions, fault as well as failure symptoms, as well as ready strategies, every bit shown inwards Table 1. The põrnikas taxonomy database is available here.

Trigger warning

More than 60% of DC bugs are triggered yesteryear a unmarried untimely message delivery that commits gild violation or atomicity violation, amongst regard to other messages or computation. Figure 1 shows possible triggering patterns.

This sounds similar a real hitting as well as surprising finding. How is it that simple? How do nosotros trim down most DC bugs to untimely message delivery? If you lot intend nearly it, it is truly non real surprising. What is a DC bug? It is a field inconsistency across processes. What makes the field inconsistency into a bug? A communication/message-exchange betwixt the 2 inconsistent processes.

Of course of teaching the beginning drive leading to the field inconsistency tin give notice live pretty complex. Figure iii shows that many DC bugs take away complex input preconditions, such every bit faults (63% inwards Figure 3.b), multiple protocols (80% inwards Figure 3.f), as well as background protocols (81% inwards Figure 3g). As an example, consider the juicy põrnikas described inwards Figure 4.


Can nosotros ready it? Yes, nosotros can!

The newspaper analyzes põrnikas patches to empathise developers' ready strategies for DC bugs. They disclose that DC bugs tin give notice live fixed yesteryear either disabling the triggering timing yesteryear adding extra synchronization, or yesteryear changing the system's treatment logic yesteryear ignoring/neutralizing the untimely message.

Yes, this sounds real simple. But what this perspective hides is that you lot starting fourth dimension take away to figure out the bugs earlier you lot tin give notice ready them. And identifying the DC bugs is the hard enquiry every bit they occur nondeterministically as well as are extremely difficulty to honour as well as diagnose. For example, Figure iii shows 47% of DC bugs atomic number 82 to soundless failures as well as hare hard to honour as well as debug inwards production as well as reproduce offline.

This is also a losing game, every bit you lot ever take away to play pick out stimulate got of upward amongst the novel corner cases arising as well as haunting your system. Instead of doing a case-by-case fixing of the systems, it is amend to ready our misconceptions as well as approach to designing these distributed protocols.

I similar the department on Root Causes, which talks nearly developers misconceptions nearly distributed systems that leads to these bugs. These misconceptions are elementary to correct, as well as doing as well as thence tin give notice eradicate many DC bugs.

  • One hop is faster than 2 hops.
  • No hop is faster than 1 hop.
  • Atomic blocks cannot live broken.
  • Interactions betwixt multiple protocols seem to live safe.
  • Enough states are maintained to detect/handle problems. (Upon observing that unopen to fixes add together novel in-memory/on-disk field variables to handgrip untimely message as well as fault timings.)

Another thing to improve would live to purpose amend testing tools. Recall that 63% of DC bugs surface inwards the presence of hardware faults such every bit machine crashes (and reboots), network delay as well as partitioning (timeouts), as well as disk errors. The newspaper doesn't cite the Jepsen testing tool, simply it would live a practiced check to honour many DC bugs mentioned inwards this study.

And lastly ounce of prevention is worth a pound of cure. Design your distributed protocols right inwards the starting fourth dimension place. You tin give notice purpose TLA+ to specify as well as model-check your distributed protocols/algorithms earlier you lot start implementing them.

In the lessons learned department the newspaper discusses nearly what tin give notice live done to detect/prevent DC bugs, as well as says that "No affair how sophisticated the tools are, they are ineffective without accurate specifications. This motivates the creation or inference of local specifications that tin give notice exhibit early on errors or symptoms of DC bugs." The implication is that since the protocols do non come upward amongst practiced formal specifications, the tool back upward for testing, checking, detection, etc. becomes ineffective. This indicate farther motivates that it is of import to acquire the distributed protocol right first, as well as the specification should accompany the protocol as well as thence that the implementation tin give notice live made reliable. TLA+ provides back upward for writing the specifications. You tin give notice come across my previous posts on TLA+ below.

Using TLA+ for teaching distributed systems

My sense amongst using TLA+ inwards distributed systems class

Modeling the hygienic dining philosophers algorithm inwards TLA+

There is a vibrant Google Groups forum for TLA+: https://groups.google.com/forum/#!forum/tlaplus

By clicking on label "tla" at the terminate of the transportation service you lot tin give notice accomplish all my posts nearly TLA+

0 Response to "[Paper Review] Taxdc: A Taxonomy Of Nondeterministic Concurrency Bugs Inwards Datacenter Distributed Systems"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel