Paper Review. Graphlab: A Novel Framework For Parallel Car Learning

This GraphLab newspaper is from 2010. It is written elegantly too it does a goodness chore of explaining the GraphLab abstraction too foundations. A afterwards VLDB 2012 newspaper presents how to extend this basic GraphLab abstraction to a distributed GraphLab implementation.

It seems similar GraphLab has since took off big time. There are workshops too conferences close GraphLab. And GraphLab is straightaway developed past times the fellowship Dato (formerly GraphLab inc.). Dato Inc. raised 6.75M$ from Madrona too New Enterprise Associates inwards A round, too 18.5M$ inwards B circular from Vulcan Capital too Opus Capital, equally good equally Madrona too New Enterprise Associates.

Introduction

The motivation for GraphLab was to hitting the sweetspot inwards evolution of auto learning solutions. "(Then inwards 2010) Existing high-level parallel abstractions similar MapReduce are insufficiently expressive, piece low-level tools similar MPI too Pthreads acquire out auto learning (ML) experts repeatedly solving the same pattern challenges."

GraphLab aimed to limited asynchronous iterative algorithms alongside lean computational dependencies piece ensuring information consistency too achieving goodness parallel performance. This newspaper provides an efficient multicore parallel (but non distributed) shared-memory implementation of GraphLab. (The C++ reference implementation of the GraphLab is at http://select.cs.cmu.edu/code)

Related work

MapReduce plant good on embrassingly information parallel tasks, merely is non efficient for graph processing too iterative algorithms. This describe of piece of work from the newspaper gave me a chuckle. "By coercing efficient sequential ML algorithms to satisfy the restrictions imposed past times MapReduce, nosotros ofttimes hit inefficient parallel algorithms that require many processors to survive competitive alongside comparable sequential methods." (Frank's laptop also took on against GraphLab.)

DAG abstraction represents parallel computation equally a directed acyclic graph alongside information flowing along edges betwixt vertices that correspond to computation/function. Dryad, Storm, Heron jibe here. DAG does non naturally limited iterative algorithms.

Dataflow abstraction inwards Naiad is a related work. This beingness a 2010 newspaper in that place is no comparing to Naiad inwards the paper. This is what Naiad (2013) has to say close GraphLab. "GraphLab too PowerGraph offering a dissimilar asynchronous programming model for graph computations, based on a shared retention abstraction. These asynchronous systems are non designed to execute dataflow graphs too then the notion of completeness of an epoch or iteration is less important, merely the lack of completeness notifications makes it difficult to compose asynchronous computations. Although GraphLab too PowerGraph supply a global synchronization machinery that tin sack survive used to write a plan that performs i computation after another, they exercise non attain task- or pipeline-parallelism betwixt stages of a computation. Naiad allows programs to innovate coordination solely where it is required, to back upwards hybrid asynchronous too synchronous computation."

Petuum is also a closely related work. Petuum uses BSP, alongside SSP relaxation too nonuniform convergence. It looks similar  GraphLab allows to a greater extent than fine-granular scheduling too supports data-graph computations too dependency modeling better. On the other hand, Petuum was designed clean-slate to back upwards auto learning algorithms better, too introduced model-parallelism concept. GraphLab, equally you lot volition encounter inwards the summary below is principally a graph processing framework, alongside goodness applicability for graph-based auto learning tasks. Petuum newspaper says "Petuum ML implementations tin sack run faster than other platforms (e.g. Spark, GraphLab4), because Petuum tin sack exploit model dependencies, uneven convergence too mistake tolerance". The newspaper provides performance comparing experiments alongside GraphLab.

GraphLab abstraction

The minimalism all the same generality of the GraphLab framework remind me of LISP. "The information graph G = (V, E) encodes both the work specific lean computational construction too straight modifiable plan state. The user tin sack associate arbitrary blocks of information (or parameters) alongside each vertex too directed border inwards G." (Also the fix scheduler that enables users to compose custom update schedules is clever too LISPy.)


Figure three illustrates the overall GraphLab framework. A GraphLab plan is composed of the next parts:
1. A information graph which represents the information too computational dependencies.
2. Update functions which depict local computation
3. A Sync machinery for aggregating global state
4. A information consistency model (i.e., Fully Consistent, Edge Consistent or Vertex Consistent), which determines the extent to which computation tin sack overlap.
5. Scheduling primitives which limited the gild of computation too may depend dynamically on the data.

Data Model

The GraphLab information model consists of 2 parts: a directed information graph too a shared information table. The information graph G = (V, E) encodes both the work specific lean computational construction too straight modifiable plan state. The user tin sack associate arbitrary blocks of information (or parameters) alongside each vertex too directed border inwards G. To back upwards globally shared state, GraphLab provides a shared information tabular array (SDT) which is an associative map, T: [Key] --> Value, betwixt keys too arbitrary blocks of data. Here, Key tin sack survive a vertex, edge, consequence of a sync computation equally explained below, or a global variable, say model parameter.

User defined computation

Computation inwards GraphLab tin sack survive performed either through an update role which defines the local computation, or through the sync machinery which defines global aggregation. The Update Function is analogous to the Map inwards MapReduce, too sync machinery is analogous to the Reduce operation. A GraphLab plan may consist of multiple update functions too it is upwards to the scheduling model to create upwards one's take away heed which update functions are applied to which vertices too inwards which parallel order.

Scheduling

The GraphLab update schedule describes the gild inwards which update functions are applied to vertices too is represented past times a parallel data-structure called the scheduler. The scheduler abstractly represents a dynamic listing of tasks (vertex-function pairs) which are to survive executed past times the GraphLab engine. This is minimalist all the same flexible too full general model. Since writing a scheduler is hard, GraphLab provides a default BSP trend scheduler or a round-robin scheduler.

Since many ML algorithms (e.g., Lasso, CoEM, Residual BP) require to a greater extent than command over the tasks that are created too the gild inwards which they are executed, GraphLab also allows update functions to add together too reorder tasks. For this, GraphLab provides 1) FIFO schedulers that solely permit project creation merely exercise non permit project reordering, too 2) prioritized schedules that permit project reordering at the terms of increased overhead.

Evaluation

The newspaper  demonstrates the expressiveness of the GraphLab framework past times designing too implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso too Compressed Sensing.

Links

GraphLab Wikipedia page

Paper Review. Petuum: A novel platform for distributed auto learning on big data

PThreads

0 Response to "Paper Review. Graphlab: A Novel Framework For Parallel Car Learning"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel