We all know that Google, in particular, uses frameworks like MapReduce to produce the results of, say, PageRank from a very large number of clusters. In light of yesterday's entry on massive datasets and the need in our community to probably go for parallel coimputing for reconstruction purposes, Danny Bickson sent me the following:
I was very glad to see you posted a note about my other paper - a distributed machine learning framework called GraphLab. Will you be kind to also include a link to our open source implementation: http://select.cs.cmu.edu/code/
Our goal is to provide machine learning/ info theory / convex optimization people a useful framework that will enable to scale sparse iterative algorithms to very large scales, without worrying about distributed programming, caching, locking, data migration, partitioning, scheduling, load balancing etc. We further support asynchronous computation, unlike other frameworks.For the info theory community we provide two example implemented algorithms: Gaussian belief propagation and two variants of Compressive sensing implementing DISTRIBUTED LASSO: shooting algorithm and Boy'd l-1 interior point methods.
When should you prefer GraphLab over MapReduce? And vice-versa? was answered by another co-author of the paper, Joseph Gonzalez who responded with:
MapReduce is good for single-iteration and embarassingly parallel distributed tasks like feature processing, while GraphLab is good for iterative algorithms with computational dependencies or complex asynchronous schedules. For instance, we find that GraphLab is highly suitable for Gibbs sampling, EM-style algorithms and some classes of optimization algorithms. Programs which fit well on a systolic abstraction (such as PageRank on Pregel) will also work well with GraphLab. There are probably a lot more algorithms that will fit well in the GraphLab and we are still exploring the capabilities and implications of the abstraction (and whether further extensions will be needed).
We are in the process of releasing our (shared memory) multi-core implementation of GraphLab and are in the process of designing a (distributed-memory) cluster implementation. Currently we do not provide fault tolerance so if fault tolerance is a critical requirement, we recommend the use of Hadoop/MapReduce. The GraphLab abstraction does not preclude fault tolerance though due to the computational / state dependencies, distributed fault tolerance is slightly more difficult and is currently ongoing research.
The current implementation is a C++ API which can easily work with other tools like Hadoop and SQL (we use some of these for logging already) and does not require any additional language support. It is important to keep in mind that while we provide a reference implementation, we are describing an abstraction that could be integrated into other software tools like Hadoop.
The overview of the project reads:
Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance.
The popular MapReduce abstraction, is defined in two parts, a Map stage which performs computation on indepedent problems which can be solved in isolation, and a Reduce stage which combines the results.
GraphLab provides a similar analog to the Map in the form of an Update Function. The Update Function however, is able to read and modify overlapping sets of data (program state) in a controlled fashion as defined by the user provided data graph. The user provided data graph represents the program state with arbitrary blocks of memory associated with each vertex and edges. In addition the update functions can be recursively triggered with one update function spawning the application of update functions to other vertices in the graph enabling dynamic iterative computation. GraphLab uses powerful scheduling primitives to control the order update functions are executed.
The GraphLab analog to Reduce is the Sync Operation. The Sync Operation also provides the ability to perform reductions in the background while other computation is running. Like the update function sync operations can look at multiple records simultaneously providing the ability to operate on larger dependent contexts.
For more details on the GraphLab abstraction see: