On Dataflow Systems, Naiad In Addition To Tensorflow

The below Definition for dataflow programming is from Wikipedia (with some pocket-size edits):
"Traditionally, a plan is modeled equally a serial of operations happening inwards a specific order; this may live on referred to equally sequential, procedural, or imperative programming. The plan focuses on commands for programming, where information is usually /at rest/.

In contrast, dataflow programming emphasizes the movement of data; it models a plan equally a serial of connections alongside explicitly defined inputs in addition to outputs, in addition to connect operations. An functioning runs equally shortly equally all of its inputs top available. Thus, dataflow languages are inherently parallel in addition to tin piece of occupation good inwards large, decentralized systems."

Some examples of dataflow frameworks are map-reduce, Spark, Storm & Heron, GraphX, GraphLab, Naiad (now implemented inwards Rust equally Timely-Dataflow), in addition to Tensorflow.

Map-Reduce uses a rattling unproblematic directed-acyclic-graph (DAG) alongside exclusively 2 operations: map in addition to reduce. Spark extends map-reduce to a pair dozen operations inwards add-on to map in addition to reduce, in addition to implements information storage/processing to live on in-memory. More specifically, inwards Spark, a computation is modeled equally a directed acyclic graph (DAG), where each vertex denotes a Resilient Distributed Dataset (RDD) in addition to each border denotes an functioning on RDD. RDDs are collection of objects divided inwards logical partitions that are stored in addition to processed equally in-memory, alongside shuffle/overflow to disk.

Twitter's Storm in addition to Heron are dataflow stream-processing frameworks. In Storm (and its subsequent modular implementation Heron), bolts in addition to spouts corresponds to dataflow operations in addition to streams.

I had written summaries nearly GraphX in addition to GraphLab approaches to dataflow equally well.

In this post, I volition focus to a greater extent than on Naiad (and timely-dataflow), in addition to sweat to compare/contrast that alongside TensorFlow.

Naiad, timely-dataflow, in addition to differential-dataflow

Naiad introduced a dataflow programming framework that allows cycles inwards the dataflow graph. Naiad's dataflow model supports structured loops allowing feedback inwards the dataflow, stateful information menstruation vertices capable of consuming in addition to producing records without global coordination, in addition to notifications for vertices 1 time they accept received all records for a given circular of input or loop iteration. Here is my summary of Naiad from 2014.

Frank McSherry has been singlehandedly (I don't know, perchance he uses both his hands) implementing Naiad inwards Rust for the concluding pair years, in addition to calls the projection timely-dataflow. He also has some other projection that provides implementation of differential dataflow inwards Naiad using timely-dataflow on Rust.

Differential dataflow (i.e., incremental dataflow) agency streaming iterative computation in addition to doing incremental computation exclusively inwards response to changing data. "If y'all alter 1 input to your computation, y'all would prefer non to re-evaluate the entire computation. Rather, changes to the input produces some changes inwards the output of your starting fourth dimension dataflow operator, which are therefore changes to the inputs of subsequent operators. Eventually these changes either evaporate or fall out the halt of your dataflow equally changed outputs. As changes menstruation around the dataflow, they happen at diverse logical times. These times reverberate the epoch of streaming input data, in addition to iteration counters for whatsoever loops they may live on contained in. Rather than eagerly aggregate these changes, nosotros move out them disaggregated, allowing hereafter computation to piece of occupation arbitrary subsets of the changes. Although nosotros move out ourselves to a greater extent than piece of occupation to practise for each change, nosotros halt upward alongside an orders-of-magnitude reduction inwards the numbers of changes."

Theres is a technical distinction betwixt differential in addition to incremental. Incremental plant alongside "sequence of arbitrary updates" whereas differential plant with "partial lodge of arbitrary updates". The erstwhile has been around a while, has lots of prior art, etc. Differential is pretty novel in addition to spell to a greater extent than full general is pretty similar inwards most cases (e.g. streaming bring together is the same for both of them; streaming bring together inwards a loop is exclusively possible inwards differential).

Naiad versus TensorFlow

TensorFlow is a exceptional instantiation/application of Naiad's timely-dataflow model of cyclic dataflow processing for the domain of automobile learning in addition to specifically deep-learning. It is good integrated alongside tools, Python, GPUs, etc. Here is my summary of TensorFlow if y'all bespeak a refresher.

I recollect TensorFlow traded off generality of Naiad in addition to gained much to a greater extent than inwards return. This is what some people telephone scream upward "addition past times subtraction".

Naiad aimed to satisfy real-time querying. TensorFlow is OK alongside batch training. TensorFlow tin all the same output a resultant at the halt of each minibatch, non a large deal. The real-time querying requirement inwards Naiad may live on an unrealistic in addition to unnecessarily/impractical requirement to shoot for inwards practical systems. Who knows. Are at that topographic point practical applications that require tight real-time querying alongside the most recent updates reflected inwards the view? (More on this, at the end, on the applications discussion.)

Naiad aimed at exactly right output. If something changes slightly, Naiad volition recompute effected things (incrementally, yay) in addition to give y'all the right output. On the other hand, if something changes, TensorFlow volition consider the changed things equally a novel minibatch to develop with, equally a resultant of which non much may change. Machine Learning in addition to Deep Learning tolerate incertitude anyhow. Minibatches are how TensorFlow handles differential/incremental dataflow implicitly. No complicated machinery is required. If things change, that is some other minibatch to develop for, in addition to most of the previously computed model parameters may rest unaffected.

(Naiad's differential/incremental processing imposes extra retentiveness requirements. It is unclear to me how much solid set down needs to live on held at the stateful-operations nearly the input-graph therefore that incremental processing is all the same possible.)

Finally, Naiad has a bring together functioning a rattling strong/capable yet a costly operation. TensorFlow does non accept a bring together operation, only that makes TensorFlow operations easier to shard/partition across machines. Maybe TensorFlow doesn't accept bring together operations because its inputs are assumed to live on independent. Naiad assumes inputs tin live on dependent: similar parts of a graph, or update to an before graph, streaming into Naiad. So Naiad has a bring together operator only pays the toll for that powerful operator.

Naiad's generality should opened upward it upward for to a greater extent than potential applications. Differential dataflow is ideal for graph algorithms that update the graph or operates on a dynamic graph. This should live on useful for search engines; when the meshwork graph in addition to links update, y'all shouldn't recompute the entire resultant inwards batch. I don't know what Google uses most recently, only Google has been using an incremental transactional organisation leveraging BigTable, called Percolator, to solve this problem, without needing the differential dataflow computing ability of Naiad.

Other application areas for Naiad's computing ability could live on social network applications in addition to recommendation systems. But it seems similar social network applications practise non bespeak to live on rattling precise, gauge answers are OK alongside them. And recommendations systems practise non bespeak rattling tight real-time constraints yet.