Paper Summary: Making Feel Of Functioning Inwards Information Analytics Frameworks (Nsdi 15)

What constitutes the bottlenecks for large information processing frameworks? If CPU is a bottleneck, it is slow to fix: add together to a greater extent than machines to the computation. Of class for whatever analytics job, at that topographic point is some amount of coordination needed across machines. Otherwise, y'all are only mapping as well as transforming, but non reducing as well as aggregating information. And this is where the network as well as the disk equally bottleneck comes into play. The ground y'all don't acquire linear speedup yesteryear adding to a greater extent than machines to an analytics project is the network as well as disk bottlenecks. And a lot of enquiry as well as seek is focused on trying to optimize as well as alleviate the network as well as disk bottlenecks.

OK this sounds easy, as well as it looks similar nosotros sympathize the bottlenecks inward large information analytics. But this newspaper argues that at that topographic point is a demand to set to a greater extent than encounter agreement the functioning of large information analytics framework, as well as shows that at to the lowest degree for Spark on the benchmarks as well as workloads they tried (see Table 1), at that topographic point are some counter intuitive results. For Spark, the network is non much of a bottleneck: Network optimizations tin exclusively bring down project completion fourth dimension yesteryear a median of at most 2%. The disk is to a greater extent than of a bottleneck than the network: Optimizing/eliminating disk accesses tin bring down project completion fourth dimension yesteryear a median of at most 19%. But most interestingly, the newspaper shows that CPU is frequently the bottleneck for Spark, as well as thus engineers should endure careful virtually trading off I/O fourth dimension for CPU fourth dimension using to a greater extent than sophisticated serialization as well as compression techniques.

This is a fleck besides much to digest at once, as well as thus let's starting fourth dimension amongst the observations virtually disk versus network bottlenecks inward Spark. Since shuffled information is e'er written to disk as well as read from the disk, the disk constitutes to a greater extent than of a bottleneck than the network inward Spark. (Here is a groovy Spark refresher. Yes RDDs help, but pipelining is possible exclusively inside a stage. Across the stages, shuffling is needed, as well as the intermediate shuffled information is e'er written to as well as read from the disk.)

To elaborate to a greater extent than on this point, the newspaper says: "One ground network functioning has niggling resultant on project completion fourth dimension is that the information transferred over the network is a subset of information transferred to disk, as well as thus jobs bottleneck on the disk earlier bottlenecking on the network [even using a 1Gbps network]." While prior run has constitute much larger improvements from optimizing network performance, the newspaper argues that prior run focused to a greater extent than frequently than non on workloads where shuffle information is equal to the amount of input data, which is non example of typical workloads (where shuffle information is unopen to 1 3rd of input data). Moreover the newspaper argues, prior run used incomplete metrics, conflating the CPU as well as network utilization. (More on this afterward below, where nosotros speak over the blocked fourth dimension analysis introduced inward this paper.)

OK, similar a shot for the CPU beingness the bottleneck, isn't that what nosotros want? If the CPU becomes the bottleneck (and non the network as well as the disk), nosotros tin add together to a greater extent than machines to it to amend processing time. (Of class at that topographic point is a side resultant that this volition inward plough create to a greater extent than demand for network as well as disk usage to consolidate the extra machines. But adding to a greater extent than machines is nevertheless an slow route to accept until adding machines starting fourth dimension to harm.) But I justice at that topographic point is skillful CPU utilization, as well as not-so-good CPU utilization, as well as the newspaper takes number amongst the latter. If y'all possess got already a lot overhead/waste associated amongst your CPU processing, it volition endure easier to speed upward your framework yesteryear adding to a greater extent than machines, but that doesn't necessarily brand your framework an efficient framework equally it is argued inward "Scalability, but at what COST?".

So I guess, the principal criticisms inward this newspaper for Spark is that Spark is non utilizing the CPU efficiently as well as leaves a lot of functioning on the table. Given the simplicity of the computation inward some workloads, the authors were surprised to respect the computation to endure CPU bound. The newspaper blames this CPU over-utilization on the next factors. One ground is that Spark frameworks frequently store compressed information (in increasingly sophisticated formats, e.g. Parquet), trading CPU fourth dimension for I/O time. They constitute that if they instead ran queries on uncompressed data, most queries became I/O bound. A 2nd ground that CPU fourth dimension is large is an artifact of the conclusion to write Spark inward Scala, which is based on Java: "after beingness read from disk, information must endure deserialized from a byte buffer to a Java object". They respect that for some queries considered, equally much equally one-half of the CPU fourth dimension is spent deserializing as well as decompressing data. Scala is high-level linguistic communication as well as has overheads; for 1 query that they re-wrote inward C++ instead of Scala, they constitute that the CPU fourth dimension reduced yesteryear a component division of to a greater extent than than 2x.

It seems similar Spark is paying a lot of functioning penalization for their alternative of Scala equally the programming language. It turns out the programming linguistic communication alternative was also a component division behind the stragglers: Using their blocked fourth dimension analysis technique, the authors position the 2 leading causes of Spark stragglers equally Java's garbage collection as well as fourth dimension to transfer information to as well as from disk. The newspaper also mentions that optimizing stragglers tin exclusively bring down project completion fourth dimension yesteryear a median of at most 10%, as well as inward 75% of queries, they tin position the crusade of to a greater extent than than 60% of stragglers.

Blocked fourth dimension analysis

A major contribution of the newspaper is to innovate "blocked fourth dimension analysis" methodology to enable deeper analysis of end-to-end functioning inward information analytics frameworks.

It is besides complicated to infer project bottlenecks yesteryear only looking at log of parallel tasks. Instead the newspaper argues, nosotros should perish amongst the resources perspective, as well as seek to infer how much faster would the project consummate if tasks were never blocked on the network. The blocked fourth dimension analysis method instruments the application to mensurate performance, uses simulations to respect improved completion fourth dimension spell taking novel scheduling opportunities into account.

Conclusions

In sum, this newspaper raised to a greater extent than questions than it answered for me, but that is non necessarily a bad thing. I am OK amongst beingness confused, as well as I am capable of belongings ambivalent thoughts inward my brain. These are minimal (necessary but non sufficient) requirements for beingness a researcher. I would rather possess got unanswered questions than unquestioned answers. (Aha, of course, that was a Feynman quote. "I would rather possess got questions that can't endure answered than answers that can't endure questioned." --Richard Feynman)

This analysis was done for Spark. The newspaper makes the analysis tools as well as traces available online as well as thus that others tin replicate the results. The newspaper does non claim that these results are broadly example as well as apply to other large information analytics frameworks.

Frank McSherry as well as University of Cambridge Computing Lab accept number amongst generalizability of the results, as well as run some experiments on the timely dataflow framework. Here are their post1 as well as post2 on that.

The results practise non generalize for machine learning frameworks, where network is nevertheless the meaning bottleneck, as well as optimizing the network tin compass upward to 75% gains inward performance.