Paper Summary. Snailtrail: Generalizing Critical Paths For Online Analysis Of Distributed Dataflows
Monitoring is real of import for distributed systems, in addition to I want it would have to a greater extent than attending inwards enquiry conferences. There has been run on monitoring for predicate detection purposes in addition to for performance occupation detection purposes. As automobile learning in addition to big information processing frameworks are seeing to a greater extent than action, nosotros have got been seeing to a greater extent than run on the latter category. For instance inwards ML at that topographic point have got been run on how to figure out what is the best configuration to run. And inwards the context of full general big information processing framework there has been run on identifying performance bottlenecks.
Collecting information in addition to creating statistics virtually a framework to seat the bottleneck activities seems similar an slow affair. However, the "making feel of performance" newspaper (2015) showed that this is non every bit unproblematic every bit it seems, in addition to sophisticated techniques such every bit blocked fourth dimension analysis are needed to acquire a to a greater extent than accurate pic of performance bottlenecks.
the Timely Dataflow framework. It supports monitoring of applications written inwards Flink, Spark, Timely Dataflow, Tensorflow, in addition to Heron.
Program activity graph (PAG) is a directed acyclic graph. The vertices announce the start in addition to cease of activities, such as: information processing, scheduling, buffer management, serialization, waiting, application information exchange, command messages, or unknown activities. The edges has a type in addition to a weight for the activities. The edges capture the happened-before relationships betwixt the vertices. Figure 2.a shows an example. The figure also shows how to projection this PAG to an interval, which is pretty straightforward in addition to every bit y'all expect.
As output, SnailTrail provides summaries for operation, worker, in addition to communication bottlenecks. These summaries assist seat which automobile is overloaded, in addition to which operators should last re-scaled. The figure below shows examples.
The CPA grade calculation inwards SnailTrail centers on the Transient Path Centrality concept which corresponds to the number of paths this activity appears on. In the formula e[w] corresponds to the weight of the border e, which is taken every bit the start/end fourth dimension duration of the activity.
In Figure 3, the activity (u,v) has the highest CPA grade because it is involved inwards all of the nine paths that appear inwards this window.
A noteworthy thing inwards the PAG inwards Figure iii is the hold off menstruation (denoted every bit dashed lines) afterwards b inwards worker 1. Because worker 1 is blocked amongst a hold off afterwards b, that path terminates at b, in addition to does non feed into around other path. The newspaper explains this every bit follows: "Waiting inwards our model is e'er a resultant of other, concurrent, activities, in addition to and hence is a telephone commutation chemical constituent of critical path analysis: a worker does non reach anything useful piece waiting, in addition to and hence waiting activities tin strength out never last on the critical path." Because the 2 hold off states terminates inwards deadend around of the paths, inwards this interval depicted inwards Figure 3, at that topographic point are exclusively nine possible paths.
The formula higher upward enables us to calculate the CPA scores without enumerating in addition to materializing all the transient critical paths of the computation inwards each window. But how reach y'all calculate north (which, for Figure iii is calculated every bit 9) without enumerating the critical paths? It is unproblematic really. The PAG is a directed acyclic graph (DAG). Everytime at that topographic point is a separate inwards the path, y'all re-create path_count to both edges. Everytime 2 edges join, y'all laid the novel path_count to last the amount of the path_counts inwards the 2 incoming edges. At the cease of the interval, human face at how many paths are exiting, that is N. Give this a endeavor for Figure iii yourself.
Another criticism of CPA could last that it does non compose. Try combining 2 following windows; that would atomic number 82 to changing the CPA scores for the activities. Some previously depression scored activities volition jump upward to last high, in addition to around highly scored activities volition decease down. Because of this non-composition, it becomes of import to create upward one's hear what is the best window/interval size to calculate these CPA metrics for summarization purposes. On the other hand, it may last reasonable to human face these metrics to last non-composable, since these metrics (blocked fourth dimension analysis in addition to critical fourth dimension analysis) are designed to last complex to capture inherent/deep critical activities inwards the computations.
Could at that topographic point last around other approach than the CPA scoring presented hither to capture the importance of an execution activity inwards the paths of the computation? Maybe something that uses the semantics of activity types in addition to their relation to each other?
Collecting information in addition to creating statistics virtually a framework to seat the bottleneck activities seems similar an slow affair. However, the "making feel of performance" newspaper (2015) showed that this is non every bit unproblematic every bit it seems, in addition to sophisticated techniques such every bit blocked fourth dimension analysis are needed to acquire a to a greater extent than accurate pic of performance bottlenecks.
the Timely Dataflow framework. It supports monitoring of applications written inwards Flink, Spark, Timely Dataflow, Tensorflow, in addition to Heron.
SnailTrail overview
The SnailTrail tool operates inwards v stages:- it ingests the streaming logs from the monitored distributed application,
- slices those streams into windows,
- constructs a computer program activity graph (PAG) for the windows,
- computes the critical path analysis of the windows, and
- outputs the summaries.
Program activity graph (PAG) is a directed acyclic graph. The vertices announce the start in addition to cease of activities, such as: information processing, scheduling, buffer management, serialization, waiting, application information exchange, command messages, or unknown activities. The edges has a type in addition to a weight for the activities. The edges capture the happened-before relationships betwixt the vertices. Figure 2.a shows an example. The figure also shows how to projection this PAG to an interval, which is pretty straightforward in addition to every bit y'all expect.
As output, SnailTrail provides summaries for operation, worker, in addition to communication bottlenecks. These summaries assist seat which automobile is overloaded, in addition to which operators should last re-scaled. The figure below shows examples.
Critical path analysis
Critical path analysis (CPA) is a technique originally introduced inwards the context of projection planning. CPA produces a metric which captures the importance of each activity inwards the transient critical paths of the computation.The CPA grade calculation inwards SnailTrail centers on the Transient Path Centrality concept which corresponds to the number of paths this activity appears on. In the formula e[w] corresponds to the weight of the border e, which is taken every bit the start/end fourth dimension duration of the activity.
In Figure 3, the activity (u,v) has the highest CPA grade because it is involved inwards all of the nine paths that appear inwards this window.
A noteworthy thing inwards the PAG inwards Figure iii is the hold off menstruation (denoted every bit dashed lines) afterwards b inwards worker 1. Because worker 1 is blocked amongst a hold off afterwards b, that path terminates at b, in addition to does non feed into around other path. The newspaper explains this every bit follows: "Waiting inwards our model is e'er a resultant of other, concurrent, activities, in addition to and hence is a telephone commutation chemical constituent of critical path analysis: a worker does non reach anything useful piece waiting, in addition to and hence waiting activities tin strength out never last on the critical path." Because the 2 hold off states terminates inwards deadend around of the paths, inwards this interval depicted inwards Figure 3, at that topographic point are exclusively nine possible paths.
The formula higher upward enables us to calculate the CPA scores without enumerating in addition to materializing all the transient critical paths of the computation inwards each window. But how reach y'all calculate north (which, for Figure iii is calculated every bit 9) without enumerating the critical paths? It is unproblematic really. The PAG is a directed acyclic graph (DAG). Everytime at that topographic point is a separate inwards the path, y'all re-create path_count to both edges. Everytime 2 edges join, y'all laid the novel path_count to last the amount of the path_counts inwards the 2 incoming edges. At the cease of the interval, human face at how many paths are exiting, that is N. Give this a endeavor for Figure iii yourself.
MAD questions
1) What could last around drawbacks/shortcomings of CPA? One ground CPA may non last representative is because 1 execution of the application may non last typical of all executions. For instance inwards Fig 3, w3 may have got long fourth dimension inwards the following execution inwards c-d activity in addition to that could acquire the bottleneck inwards around other execution of the application. But it is possible to fence that since SnailTrail produces summaries, this may non last a big issue. Another ground CPA may non last representative is because the execution may last information dependent. And 1 time again it is possible to fence that this won't last a big number if the application uses several information inwards processing, in addition to things acquire averaged.Another criticism of CPA could last that it does non compose. Try combining 2 following windows; that would atomic number 82 to changing the CPA scores for the activities. Some previously depression scored activities volition jump upward to last high, in addition to around highly scored activities volition decease down. Because of this non-composition, it becomes of import to create upward one's hear what is the best window/interval size to calculate these CPA metrics for summarization purposes. On the other hand, it may last reasonable to human face these metrics to last non-composable, since these metrics (blocked fourth dimension analysis in addition to critical fourth dimension analysis) are designed to last complex to capture inherent/deep critical activities inwards the computations.
Could at that topographic point last around other approach than the CPA scoring presented hither to capture the importance of an execution activity inwards the paths of the computation? Maybe something that uses the semantics of activity types in addition to their relation to each other?
2) The SnailTrail method uses snapshots to determine the start in addition to cease of the windows that the CPA scoring algorithm plant on. Does fourth dimension synchronization bespeak to last perfect for snailtrail snapshot in addition to analysis to work? What are the requirements on the fourth dimension synchronization for this to work?
It turns out this currently requires perfect clock synchronization. The evaluation experiments are run inwards the same automobile amongst 32 cores. Without perfect clock synchronization, the snapshots may non last consistent in addition to that could interruption the path calculation in addition to CPA scoring. Hybrid Logical Clocks tin strength out assist bargain amongst the incertitude periods inwards NTP clocks in addition to tin strength out brand the method work. The newspaper addresses this number inwards Appendix F: "In SnailTrail, nosotros assume that the tendency toward rigid clock synchronization inwards datacenters agency that clock skew is not, inwards practice, a meaning occupation for our analysis. If it were to acquire an issue, nosotros would have got to consider adding Lamport clocks in addition to other mechanisms for detecting in addition to correcting for clock skew."
3) The newspaper doesn't have got an instance of improvement/optimization based on summary analytics. Even when the summary shows the high scored CPA activities to amend upon, it volition last tricky for a developer to decease inwards in addition to modify the code to amend things, because at that topographic point is no indication of how slow it would last to amend the performance on this high CPA grade activities.
One way to address this could last to extend the CPA thought every bit follows. After ranking the activities based on CPA scores, laid around other ranking of activities based on repose of optimizing/improving (I don't know how to reach this, merely around heuristics in addition to domain cognition tin strength out help). Then start the improvements/optimizations from the easiest to optimize activities that have got highest CPAs.
Another way to brand the summary analytics to a greater extent than useful is to procedure it farther to supply to a greater extent than easily actionable suggestions, such every bit add together to a greater extent than RAM, to a greater extent than CPU, to a greater extent than network/IO bandwidth. Or the suggestions could also say that y'all tin strength out reach amongst less of resources of 1 kind, which tin strength out helpf for saving coin inwards cloud deployments. If y'all tin strength out run amongst similar performance merely on cheaper configurations, y'all would similar to have got that option. Would this extension require adopting SnailTrail monitoring in addition to CPA scoring to a greater extent than towards capturing resources usage related metrics?
4) Since CPA kickoff appeared inwards the project/resource management domain, could at that topographic point last other techniques at that topographic point that tin strength out apply to performance monitoring of distributed execution?
5) I remember SnailTrail breaks novel solid soil inwards the "context" or "philosophy" of monitoring: it is non delineate based, it is non snapshot based, merely it is window/interval-based. In our run on Retroscope, nosotros argued it is of import to aggregate the paths for both spatially (across the distributed computation) in addition to temporally (as an evolving pic of the system's performance). SnailTrail extends the snapshot sentiment to window views.
Miscellaneous
Frank McSherry late joined ETH Zurich's systems group in addition to volition last helping amongst the strymon projection in addition to the SnailTrail work, in addition to hence I'll last looking frontwards to to a greater extent than monitoring run from there.
0 Response to "Paper Summary. Snailtrail: Generalizing Critical Paths For Online Analysis Of Distributed Dataflows"
Post a Comment