Sosp19 24-Hour Interval 1, Debugging Session
This session was the offset session later on luncheon together with had 4 papers on debugging inwards large scale systems.
Crash recovery code tin hold upward buggy together with frequently termination inwards catastrophic failure. Random fault injection is ineffective for detecting them equally they are rarely exercised. Model checking at the code score is non viable due to acre infinite explosion problem. As a result, crash-recovery bugs are withal widely prevalent. Note that the newspaper does non beak virtually "crush" bugs, simply "crash recovery" bugs, where the recovery code interferes amongst normal code together with causes the error.
Crashtuner introduces novel approaches to automatically uncovering crash recovery bugs inwards distributed systems. The newspaper observes that crash-recovery bugs involve "meta-info" variables. Meta-info variables include variables denoting nodes, jobs, tasks, applications, containers, attempt, session, etc. I gauge these are critical metadata. The newspaper mightiness include to a greater extent than description for them.
The insight inwards the newspaper is that crash-recovery bugs tin hold upward easily triggered when nodes crash earlier reading meta-info variables and/or crash later on writing meta-info variables.
Using this insight, Crashtuner inserts crash points at read/write of meta-info variables. This results inwards a 99.91% reduction on crash points amongst previous testing techniques.
They evaluated Crashtuner on Yarn, HDFS, HBase, together with ZooKeeper together with constitute 116 crash recovery bugs. 21 of these were novel crash-recovery bugs (including 10 critical bugs). threescore of these bugs were already fixed.
The presentation concluded past times maxim that meta-info is a well-suited abstraction for distributed systems. After the presentation, I withal convey questions virtually identifying meta-info variables together with what would hold upward false-positive together with false-negative rates for finding meta-info variables via heuristic Definition equally above.
This newspaper is virtually debugging for finding the origin displace of a failure. The basic approach is to collect large number of traces (via failure reproduction together with failure execution), together with attempt to uncovering the strongest statistical correlation amongst the fault, together with position this equally the origin cause. The newspaper asks the question: what is the primal holding of origin displace that allows us to build a tool to automatically position the origin cause?
The newspaper offers equally the Definition for origin displace equally "the most basic argue for failure, if corrected volition preclude the fault from occurring." This has 2 parts: "if changed would termination inwards right execution" together with "the most basic cause".
Based on this, the newspaper defines inflection betoken equally the offset betoken inwards the failure execution that differs from the educational activity inwards nonfailure execution. And develops Kairux: a tool for automated origin displace localization. The sentiment inwards Kairux is to build the nonfailure execution that has the longest mutual prefix. To this terminate it uses unit of measurement tests, stitches unit of measurement examine to build nonfailure execution, together with modifies existing unit of measurement examine for longer mutual prefix. Then it uses dynamic slicing to obtain partial order.
The presentation gave a existent footing instance from HDFS 10453, delete blockthread. It took the developer 1 calendar month to figure out the origin displace of the bug. Kairux does this automatically.
Kairux was evaluated on 10 cases from JVM distributed systems, including HDFS, HBase, ZooKeeper. It successfully constitute the origin displace for vii out the 10 cases. For the 3 unsuccessful cases, the newspaper claims this was because the origin displace location could non hold upward reached past times modifying unit of measurement tests
This newspaper was similar to the previous newspaper inwards the session inwards that it had a heuristic insight which had applicability inwards a reasonably focused narrow domain. I intend the tool back upward would hold upward welcome past times developers. Unfortunately I didn't come across that the code together with tool is available equally opensource anywhere.
The presentation offers past times quest "Can file systems hold upward põrnikas free?" together with answers this inwards the negative, citing that the codebase for filesystems is massive (40K-100K) together with are constantly evolving. This newspaper proposes to utilization fuzzing equally an umbrella solution that unifies existing põrnikas checkers for finding semantic bugs inwards filesystems.
The sentiment inwards fuzzing is to give crashes are feedback to the fuzzers. However, the challenge for finding semantic bugs using fuzzers is that semantic bugs are silent, together with won't hold upward detected. So nosotros demand a checker to become through the examine cases together with banking company jibe for the validity of render values, together with give this feedback to fuzzer.
To realize this insight, they built Hydra. Hydra is available equally opensource at https://github.com/sslab-gatech/hydra
Hydra uses checker defined signals, automates input infinite exploration, examine execution, together with incorporation of examine cases. Hydra is extensible via pluggable checkers for spec violation posix checker (sybilfs), for logic bugs, for retention security bugs, together with for crash consistency põrnikas (symC3).
So far, Hydra has discovered 91 novel bugs inwards Linux file systems, including several crash consistency bugs. Hydra also constitute a põrnikas inwards a verified file arrangement (FSCQ), (because it had used an unverified utilization inwards implementation).
The presenter said that Hydra generates amend examine cases, together with the minimizer tin cut down the steps inwards crashes from 70s to 20s. The presentation also alive demoed Hydra inwards activity amongst symC3.
This newspaper received a best newspaper honour at SOSP19. It provides an slowly force push for finding concurrency bugs.
The newspaper deals amongst thread security violations (TSV). A thread security violation occurs if 2 threads concurrently invoke 2 conflicting methods upon the same object. For example, the C# listing datastructure has a contract that says 2 adds cannot hold upward concurrent. Unfortunately thread security violations withal exist, together with are difficult to uncovering via testing equally they don't present upward inwards most examine runs. The presentation mentioned a major põrnikas that atomic number 82 to Bitcoin loss.
Thread security violations are really similar to information race conditions, together with it is possible to utilization data-race detection tools inwards a manually intensive physical care for to uncovering about of these bugs inwards pocket-size scale. To cut down the manual effort, it is possible to adopt dynamic information race analysis patch running the programme nether examine inputs, simply these require a lot of false-positive pruning.
In a large scale, these don't work. The CloudBuild at Microsoft involves 100million tests from 4K squad together with upto 10K machines. At this scale, in that location are 3 challenges: integration, overhead, together with imitation positives.
The newspaper presents TSVD, a scalable dynamic analysis tool. It is force button. You furnish TSVD exclusively the thread security contract, together with it finds the results amongst nix imitation positives. TSVD was deployed inwards Azure, together with it has constitute to a greater extent than than 1000 bugs inwards a brusque time. The tool is available equally opensource at https://github.com/microsoft/TSVD
To attain nix imitation positive, TSVD uses a really interesting trick. A potential violation (i.e., a telephone telephone site that was identified past times code analysis equally 1 that may potentially violate the thread security contract) is retried inwards many examine executions past times injecting delays to trigger a existent violation. If a existent violation is found, this is a truthful bug. Else, it was a false-positive together with is ignored.
But how create nosotros create the analysis to position these potentially dangerous calls to insert delays? TSVD uses about other interesting fox to position them. It looks for conflicting calls amongst close-by physical timestamps. It flags probable racing calls, where 2 conflicting calls from dissimilar threads to the same object occur inside a brusque physical fourth dimension window. This vogue of doing things is to a greater extent than efficient together with scalable than trying to create a happened-before analysis together with finding calls amongst concurrent logical timestamps. Just position probable race calls.
OK, what if in that location is actual synchronization betwixt the 2 potentially conflicting calls inside closeby physical timestamps? Why waste materials unloosen energy to popular off on testing it to suspension this? Due to synchronization, this won't atomic number 82 to a existent bug. To avoid this they utilization synchronization inference (another swell trick!): If m1 synchronized earlier m2, a delay added to m1 leads to the same delay to m2. If this unopen correlation is observed inwards the delays, TSVD infers synchronization. This vogue it also infers if a programme is running sequentially or not, which calls are to a greater extent than probable to atomic number 82 to problems, etc.
They deployed TSVD at Microsoft for several months. It was given thread security contracts of fourteen arrangement classes inwards C#, including list, dictionary, etc. It was tested on 1600 projects, together with was run 1 or 2 times, together with constitute 1134 thread security violations. During the validation procedure, they constitute that 96% TSVs are previously unknown to developers together with 47% volition displace severe client facing issues eventually.
TSVD beats other approaches, including random, information collider, happened-before (hb) tracking. 96% of all violations were captured past times running TSVD l times. And 50% violations were captured past times running TSVD once! This beats other tools amongst picayune overhead.
One drawback to TSVD approach is that it may displace a imitation negative past times adding the random delay. But when you lot run the tool multiple times, those missed imitation negatives are captured due to dissimilar random delays tried.
Yep, this newspaper definitely deserved a best newspaper award. It used 3 really interesting insights/heuristics to brand the work feasible/manageable, together with thence built a tool using these insights, together with showed exhaustive evaluations of this tool.
CrashTuner: Detecting Crash Recovery Bugs inwards Cloud Systems via Meta-info Analysis
This newspaper is past times Jie Lu (The Institute of Computing Technology of the Chinese Academy of Sciences), Chen Liu (The Institute of Computing Technology of the Chinese Academy of Sciences), Lian Li (The Institute of Computing Technology of the Chinese Academy of Sciences), Xiaobing Feng (The Institute of Computing Technology of the Chinese Academy of Sciences), Feng Tan (Alibaba Group), Jun Yang (Alibaba Group), Liang You (Alibaba Group).Crash recovery code tin hold upward buggy together with frequently termination inwards catastrophic failure. Random fault injection is ineffective for detecting them equally they are rarely exercised. Model checking at the code score is non viable due to acre infinite explosion problem. As a result, crash-recovery bugs are withal widely prevalent. Note that the newspaper does non beak virtually "crush" bugs, simply "crash recovery" bugs, where the recovery code interferes amongst normal code together with causes the error.
Crashtuner introduces novel approaches to automatically uncovering crash recovery bugs inwards distributed systems. The newspaper observes that crash-recovery bugs involve "meta-info" variables. Meta-info variables include variables denoting nodes, jobs, tasks, applications, containers, attempt, session, etc. I gauge these are critical metadata. The newspaper mightiness include to a greater extent than description for them.
The insight inwards the newspaper is that crash-recovery bugs tin hold upward easily triggered when nodes crash earlier reading meta-info variables and/or crash later on writing meta-info variables.
Using this insight, Crashtuner inserts crash points at read/write of meta-info variables. This results inwards a 99.91% reduction on crash points amongst previous testing techniques.
They evaluated Crashtuner on Yarn, HDFS, HBase, together with ZooKeeper together with constitute 116 crash recovery bugs. 21 of these were novel crash-recovery bugs (including 10 critical bugs). threescore of these bugs were already fixed.
The presentation concluded past times maxim that meta-info is a well-suited abstraction for distributed systems. After the presentation, I withal convey questions virtually identifying meta-info variables together with what would hold upward false-positive together with false-negative rates for finding meta-info variables via heuristic Definition equally above.
The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure
This newspaper is past times Yongle Zhang (University of Toronto), Kirk Rodrigues (University of Toronto), Yu Luo (University of Toronto), Michael Stumm (University of Toronto), Ding Yuan (University of Toronto).This newspaper is virtually debugging for finding the origin displace of a failure. The basic approach is to collect large number of traces (via failure reproduction together with failure execution), together with attempt to uncovering the strongest statistical correlation amongst the fault, together with position this equally the origin cause. The newspaper asks the question: what is the primal holding of origin displace that allows us to build a tool to automatically position the origin cause?
The newspaper offers equally the Definition for origin displace equally "the most basic argue for failure, if corrected volition preclude the fault from occurring." This has 2 parts: "if changed would termination inwards right execution" together with "the most basic cause".
Based on this, the newspaper defines inflection betoken equally the offset betoken inwards the failure execution that differs from the educational activity inwards nonfailure execution. And develops Kairux: a tool for automated origin displace localization. The sentiment inwards Kairux is to build the nonfailure execution that has the longest mutual prefix. To this terminate it uses unit of measurement tests, stitches unit of measurement examine to build nonfailure execution, together with modifies existing unit of measurement examine for longer mutual prefix. Then it uses dynamic slicing to obtain partial order.
The presentation gave a existent footing instance from HDFS 10453, delete blockthread. It took the developer 1 calendar month to figure out the origin displace of the bug. Kairux does this automatically.
Kairux was evaluated on 10 cases from JVM distributed systems, including HDFS, HBase, ZooKeeper. It successfully constitute the origin displace for vii out the 10 cases. For the 3 unsuccessful cases, the newspaper claims this was because the origin displace location could non hold upward reached past times modifying unit of measurement tests
This newspaper was similar to the previous newspaper inwards the session inwards that it had a heuristic insight which had applicability inwards a reasonably focused narrow domain. I intend the tool back upward would hold upward welcome past times developers. Unfortunately I didn't come across that the code together with tool is available equally opensource anywhere.
Finding Semantic Bugs inwards File Systems amongst an Extensible Fuzzing Framework
This newspaper is past times Seulbae Kim (Georgia Institute of Technology), Meng Xu (Georgia Institute of Technology), Sanidhya Kashyap (Georgia Institute of Technology), Jungyeon Yoon (Georgia Institute of Technology), Wen Xu (Georgia Institute of Technology), Taesoo Kim (Georgia Institute of Technology).The presentation offers past times quest "Can file systems hold upward põrnikas free?" together with answers this inwards the negative, citing that the codebase for filesystems is massive (40K-100K) together with are constantly evolving. This newspaper proposes to utilization fuzzing equally an umbrella solution that unifies existing põrnikas checkers for finding semantic bugs inwards filesystems.
The sentiment inwards fuzzing is to give crashes are feedback to the fuzzers. However, the challenge for finding semantic bugs using fuzzers is that semantic bugs are silent, together with won't hold upward detected. So nosotros demand a checker to become through the examine cases together with banking company jibe for the validity of render values, together with give this feedback to fuzzer.
To realize this insight, they built Hydra. Hydra is available equally opensource at https://github.com/sslab-gatech/hydra
Hydra uses checker defined signals, automates input infinite exploration, examine execution, together with incorporation of examine cases. Hydra is extensible via pluggable checkers for spec violation posix checker (sybilfs), for logic bugs, for retention security bugs, together with for crash consistency põrnikas (symC3).
So far, Hydra has discovered 91 novel bugs inwards Linux file systems, including several crash consistency bugs. Hydra also constitute a põrnikas inwards a verified file arrangement (FSCQ), (because it had used an unverified utilization inwards implementation).
The presenter said that Hydra generates amend examine cases, together with the minimizer tin cut down the steps inwards crashes from 70s to 20s. The presentation also alive demoed Hydra inwards activity amongst symC3.
Efficient together with Scalable Thread-Safety Violation Detection --- Finding thousands of concurrency bugs during testing
This newspaper is past times Guangpu Li (University of Chicago), Shan Lu (University of Chicago), Madanlal Musuvathi (Microsoft Research), Suman Nath (Microsoft Research), Rohan Padhye (Berkeley).This newspaper received a best newspaper honour at SOSP19. It provides an slowly force push for finding concurrency bugs.
The newspaper deals amongst thread security violations (TSV). A thread security violation occurs if 2 threads concurrently invoke 2 conflicting methods upon the same object. For example, the C# listing datastructure has a contract that says 2 adds cannot hold upward concurrent. Unfortunately thread security violations withal exist, together with are difficult to uncovering via testing equally they don't present upward inwards most examine runs. The presentation mentioned a major põrnikas that atomic number 82 to Bitcoin loss.
Thread security violations are really similar to information race conditions, together with it is possible to utilization data-race detection tools inwards a manually intensive physical care for to uncovering about of these bugs inwards pocket-size scale. To cut down the manual effort, it is possible to adopt dynamic information race analysis patch running the programme nether examine inputs, simply these require a lot of false-positive pruning.
In a large scale, these don't work. The CloudBuild at Microsoft involves 100million tests from 4K squad together with upto 10K machines. At this scale, in that location are 3 challenges: integration, overhead, together with imitation positives.
The newspaper presents TSVD, a scalable dynamic analysis tool. It is force button. You furnish TSVD exclusively the thread security contract, together with it finds the results amongst nix imitation positives. TSVD was deployed inwards Azure, together with it has constitute to a greater extent than than 1000 bugs inwards a brusque time. The tool is available equally opensource at https://github.com/microsoft/TSVD
To attain nix imitation positive, TSVD uses a really interesting trick. A potential violation (i.e., a telephone telephone site that was identified past times code analysis equally 1 that may potentially violate the thread security contract) is retried inwards many examine executions past times injecting delays to trigger a existent violation. If a existent violation is found, this is a truthful bug. Else, it was a false-positive together with is ignored.
But how create nosotros create the analysis to position these potentially dangerous calls to insert delays? TSVD uses about other interesting fox to position them. It looks for conflicting calls amongst close-by physical timestamps. It flags probable racing calls, where 2 conflicting calls from dissimilar threads to the same object occur inside a brusque physical fourth dimension window. This vogue of doing things is to a greater extent than efficient together with scalable than trying to create a happened-before analysis together with finding calls amongst concurrent logical timestamps. Just position probable race calls.
OK, what if in that location is actual synchronization betwixt the 2 potentially conflicting calls inside closeby physical timestamps? Why waste materials unloosen energy to popular off on testing it to suspension this? Due to synchronization, this won't atomic number 82 to a existent bug. To avoid this they utilization synchronization inference (another swell trick!): If m1 synchronized earlier m2, a delay added to m1 leads to the same delay to m2. If this unopen correlation is observed inwards the delays, TSVD infers synchronization. This vogue it also infers if a programme is running sequentially or not, which calls are to a greater extent than probable to atomic number 82 to problems, etc.
They deployed TSVD at Microsoft for several months. It was given thread security contracts of fourteen arrangement classes inwards C#, including list, dictionary, etc. It was tested on 1600 projects, together with was run 1 or 2 times, together with constitute 1134 thread security violations. During the validation procedure, they constitute that 96% TSVs are previously unknown to developers together with 47% volition displace severe client facing issues eventually.
TSVD beats other approaches, including random, information collider, happened-before (hb) tracking. 96% of all violations were captured past times running TSVD l times. And 50% violations were captured past times running TSVD once! This beats other tools amongst picayune overhead.
One drawback to TSVD approach is that it may displace a imitation negative past times adding the random delay. But when you lot run the tool multiple times, those missed imitation negatives are captured due to dissimilar random delays tried.
Yep, this newspaper definitely deserved a best newspaper award. It used 3 really interesting insights/heuristics to brand the work feasible/manageable, together with thence built a tool using these insights, together with showed exhaustive evaluations of this tool.
0 Response to "Sosp19 24-Hour Interval 1, Debugging Session"
Post a Comment