Sosp19 Solar Daytime 1, Car Learning Session

In this post, I comprehend the papers that appeared inwards the start session of the conference: machine learning. (Here is my concern human relationship of Day 0 together with the opening remarks if you lot are interested.)

I haven't read whatever of these papers yet together with I acquire yesteryear my agreement of these papers from their presentations. I mightiness pick out misunderstood some parts. The adept intelligence is all of these papers are available equally opened upward access, together with I include links to the papers inwards my notes. Please cheque the papers when inwards incertitude nigh my notes.

The machine learning session contained iv papers. I establish all of them real interesting. They applied principled systems blueprint techniques to machine learning together with provided results that pick out broader applicability than a unmarried application. I wanted to come inwards the machine learning inquiry surface area 3 years ago. But I was unsuccessful together with concluded that the surface area is non real amenable for doing principled systems work. It looks similar I had admitted defeat prematurely. After seeing these papers inwards this section, I am excited in 1 lawsuit again for the fusion together with synergy of machine learning together with principled distributed systems areas.

PipeDream: Generalized Pipeline Parallelism for DNN Training

This newspaper is yesteryear Deepak Narayanan (Stanford University), Aaron Harlap (Carnegie Mellon University), Amar Phanishayee (Microsoft Research), Vivek Seshadri (Microsoft Research), Nikhil R. Devanur (Microsoft Research), Gregory R. Ganger (CMU), Phillip B. Gibbons (Carnegie Mellon University), Matei Zaharia (Stanford University).

There are 2 prevalent approaches to parallelizing DNN training: data parallelism together with model parallelism. This newspaper proposes pipeline-parallel training that combines information together with model parallelism with pipelining. In this approach, the model is start divided into sequential stages, together with hence these stages are pipelined over the workers. To optimize the pipelining over the workers, the bottleneck stages are identified, together with those stage are suitably information parallelized across multiple workers to foreclose the pipeline from stalling, waiting for results from previous stages.

The authors position 3 large challenges to realize this persuasion together with develop techniques for addressing these challenges.

Partitioning together with charge balancing operators across workers: They developed a profiler together with optimizer to charge residuum computing together with cut communication.
Scheduling of frontward together with backward passes of unlike inputs: They role a bi-directional pipeline, where an input minibatch proceeds through the computation pipeline start frontward together with hence backward. Each active minibatch inwards the pipeline may survive inwards a unlike stage, either inwards the frontward overstep or backward pass. Once inwards steady state, each stage alternates betwixt performing
its frontward overstep for a minibatch together with its backward overstep for an before minibatch. This one-forward-one-backward (1F1B) scheduling ensures that every GPU is occupied with a minibatch inwards a balanced pipeline, with each stage producing outputs inwards aggregate at roughly the same rate.
Managing weights together with activation versions for effective learning: The presenter mentioned that naive pipelining leads to weight version mismatches. To foreclose this, they shop multiple <weight, activation> versions, which they telephone squall upward stashed weights.

They integrated pipedream with PyTorch using 3000 lines of Python code. They hooked to PyTorch's communication library. The projection is opensource together with accessible at http://github.com/msr-fiddle/pipedream.

Their evaluation results present that pipedream tin furnish five times faster preparation than information parallel training. The argue for speedup is pipedream reduces communication alongside workers, together with tin accomplish upward to a magnitude smaller communication than incurred yesteryear information parallelism.

There were some adept questions from the audience next the talk.

"Pipedream is evaluated on models that are sequential. What nigh models that are branched, where multiple things demand to complete before the side yesteryear side phase?" The presenter answered that the techniques/setup inwards pipedream tin generalize to handgrip them but also added that most models are sequential.
"What nigh pipedream's retentivity footprint?" The presenter said that they are looking for to cut this.
"As sparsity changes, it may survive possible to gain asynchronous preparation faster than synchronous training. Would it survive possible to crunch these results using asynchronous preparation rather than the synchronous preparation pipedream performs?" I don't pick out notes for the answer, but I intend the reply is that for DNN preparation synchronous preparation is the method that industrial plant best.

A Generic Communication Scheduler for Distributed DNN Training Acceleration

This newspaper is by Yanghua Peng (The University of Hong Kong), Yibo Zhu (ByteDance Inc.), Yangrui Chen (The University of Hong Kong), Yixin Bao (The University of Hong Kong), Bairen Yi (ByteDance Inc.), Chang Lan (ByteDance Inc.), Chuan Wu (The University of Hong Kong), Chuanxiong Guo (ByteDance Inc.).

The newspaper presents a scheduler for deep learning training, hence this has some relevance to the pipedream newspaper equally well. This newspaper focuses solely on information parallelism for scaling out preparation together with shows how to optimize it.

The FIFO scheduling strategy does non overlap communication with computation well. Although in that location has been move to improve scheduling, such equally p3 together with tictac, these were express because they are coupled with specific framework implementations, MxNet together with Tensorflow, respectively. In contrast this move presents a generic tensor scheduling framework, Bytescheduler, which was implemented together with evaluated for MxNet, PyTorch, together with Tensorflow. The projection is available equally opensource at https://github.com/bytedance/byteps/tree/bytescheduler/bytescheduler

The basic insights inwards this move are:

Communication of one-time layers of a neural network has higher priority together with tin preempt communication of latter layers
It is beneficial to segmentation large layers together with merge modest layers

This move uses bayesian optimization for auto tuning for partitioning together with scheduling control. The presenter mentioned a technique called "dependency proxy" for geting the scheduling command right. The presenter also mentioned that Bytescheduler adapts to unlike bandwidths, and unlike environments together with tasks.

Parity Models: Erasure-Coded Resilience for Prediction Serving Systems

This move is yesteryear Jack Kosaian (Carnegie Mellon University), K. V. Rashmi (Carnegie Mellon University), Shivaram Venkataraman (University of Wisconsin-Madison).

Tail latency inwards inference serving is a problem, which results inwards to missed deadlines. This move shows how to role erasure codes for reducing tail latency inwards ML inference. If 1 replica is slow, this move prescribes using the parity model together with decoding it rapidly to brand the deadline.

In this setup the inference servers pick out multiple identical models, together with a query needs to utter to k of them. The organization jointly codes queries from unlike users together with sends to unlike inference servers together with a parity server, which holds the parity model. In instance an reply is missing, the organization decodes the results of inference servers together with the parity server to reconstruct/approximate the missing reply together with yet brand the deadline. The reconstructed output solely comes into play when the master predictions are tedious or fail. To encounter the deadline, fast encoding together with decoding is needed together with the cardinal for this is the blueprint of the parity model.

The challenge hither is that, while handcrafting erasure codes is straightforward for linear operations, it is difficult to accomplish for neural networks which are nonlinear together with complex. To solve this problem, the authors apply a learning based approach to accomplish erasure coded resilience for NNs. This approach reconstruct approximations which is appropriate for machine learning inference.

The code used for preparation together with evaluating parity models is available at https://github.com/thesys-lab/parity-models. The newspaper showcases parity models inwards the presence of resources contention, together with includes extensive evaluation.

One of the questions for the presenter was "For tasks that are mission critical such equally self-driving cars, would this modal accuracy departure from the parity model survive confusing?" The reply is yes, it could be. So this is may survive to a greater extent than suitable for promotion serving similar applications, rather than mission critical applications. I personally intend the technique hither is to a greater extent than impressive than the occupation solved. The parity models persuasion pick out benefits of erasure codes to ML inference, together with this technique should survive applicable together with customizable/specializable to a rich laid upward of problems inwards the domain. I was wondering if this could survive applicable to decomposing 2 images from a given stereoscopic image.

Jack, a 3rd twelvemonth PhD pupil at CMU, presented this paper. He gave 1 of the best presentations of the conference, with confidence, peachy stage presence, together with peachy communication skills. I after learned that this was his start presentation at whatever conference ever.

TASO: Optimizing Deep Learning Computation with Automated Generation of Graph Substitutions

This newspaper is yesteryear Zhihao Jia (Stanford University), Oded Padon (Stanford University), James Thomas (Stanford University), Todd Warszawski (Stanford University), Matei Zaharia (Stanford University), Alex Aiken (Stanford Univeristy).

Existing rule-based DNN optimizations rewrite/substitute a graph with a to a greater extent than efficient version. In Tensorflow this is implemented yesteryear 2000 rewrite rules inwards 53K lines of code. Unfortunately, with this approach, add-on of novel operators together with graph structures require escalation inwards the pose out of rules. Moreover, these heuristics gain non apply for all DNN hardware. The rewrite rules lady friend subtle optimizations for specific DNNs together with hardware. Different hardware demand unlike optimizations, 1 optimization does non move for all. (To motivate the hardware dependence of the optimizations, the presentation gave an instance with TASO where the finally graph is 30% faster on V100 but 10% slower on K80).

To address these hurting points inwards optimizing deep learning, this move proposes TASO, a tensor algebra super-optimizer. TASO replaces manually designed graph optimizations with automated generation together with verification. TASO tin hence survive used to feasibly gain optimizations that are based on hardware backend, rather than generic full general rules. While TensorFlow currently contains only about 53,000 lines of manual optimization rules, the operator specifications needed yesteryear TASO are solely 1,400 lines of code.

The challenge inwards developing TASO is 2 folds: how to generate potential substitutions together with how to verify their correctness.

There are 66 Million graphs with upward to 4 operators. To brand this pose out manageable, the graph exchange generator computes output fingerprints together with considers pairs of graphs with identical output fingerprints. It finds 29K substitutions, out of which solely 743 substitutions stay after applying pruning to eliminate redundant substitutions. These 743 substitutions are generated inwards five minutes, together with verified against 43 operators inwards 10 minutes. This is done per hardware per operator, hence supporting a novel operator requires a few hours of the engineer providing the operator specifications.

In sum, TASO provides less technology scientific discipline effort, amend performance, together with formal verification for optimizing deep learning computation graphs. The code for TASO is available on github at https://github.com/jiazhihao/TASO