Paper Summary. Slaq Quality-Driven Scheduling For Distributed Car Learning

This newspaper (by Haoyu Zhang, Logan Stafman, Andrew Or, as well as Michael J. Freedman) appeared at SOCC'17.

When you lot assign a distributed auto learning (ML) application resources at the application level, those resources are allotted for many hours. However, loss improvements normally locomote on during the maiden off business office of the application execution, then it is really probable that the application is underutilizing the resources for the residue of the time. (Some ML jobs are retraining of an already trained DNN, or compacting of a DNN past times removing unused parameters, etc., then blindly giving to a greater extent than resources at the get as well as pulling some dorsum later on may non piece of work well.)

To avoid this, SLAQ allocates resources to ML applications at the project level, leveraging the iterative nature of ML preparation algorithms. Each iteration of the ML preparation algorithm submits tasks to the scheduler amongst running times some 10ms-100ms. This is how Spark based systems operate readily anyways. (The Dorm newspaper criticized this iteration-based project scheduling approach proverb that it causes high overhead for scheduling as well as introduces delays for waiting to acquire scheduled, but at that topographic point was no analysis on those claims.)

SLAQ collects "quality" (measured past times "loss" really) as well as resources usage information from jobs, as well as using these it generates quality-improvement predictions for futurity iterations as well as decides on futurity iteration project scheduling based on these predictions.

The newspaper equates "quality" amongst "loss", as well as justifies this past times saying:
1) "quality" cannot endure defined unless at the application level; then to continue it full general let's locomote "loss"
2) for exploratory preparation jobs, reaching 90% accuracy is sufficient for quality, as well as SLAQ enables to acquire at that topographic point inward a shorter fourth dimension frame.

On the other hand, at that topographic point are drawbacks to that. While delta improvements on loss may agree to improvements on the quality, the long-tail of the computation may all the same endure critical for "quality", fifty-fifty when loss is decreasing really slowly. This is especially truthful for non-convex applications.

The newspaper normalizes quality/loss metrics every bit follows: For a sure enough job, SLAQ normalizes the alter of loss values inward the electrical current iteration amongst honor to the largest alter it has seen for that project then far.

SLAQ predicts an iteration's runtime but past times how long it would own got the due north tasks/CPUs tin procedure through due south the size of information processed inward an iteration. (minibatch size.)

For scheduling based lineament improvements, the newspaper considers duo metrics, similar maximizing the full lineament as well as maximizing the minimum quantity. The newspaper includes a proficient evaluation section.

In conclusion, SLAQ improves the overall lineament of executing ML jobs faster, especially nether resources contention, past times scheduling at a finer granularity task-level based on the observed loss improvements.