Spinning fast iterative data flows

Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, Volker Markl
2012 Proceedings of the VLDB Endowment  
Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized
more » ... ems based on other paradigms, such as message passing or shared memory. We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction. existing dataflow systems execute incremental iterations as if they were bulk iterative, they are drastically outperformed by specialized systems [28, 29] . Existing dataflow systems are therefore practically inefficient for many iterative algorithms. The systems are, however, still required for other typical analysis and transformation tasks. Hence, many data processing pipelines span multiple different systems, using workflow frameworks to orchestrate the various steps. Training a model over a large data corpus frequently requires a dataflow (like MapReduce) for preprocessing the data (e. g., for joining different sources and normalization), a specialized system for the training algorithm, followed by another dataflow for postprocessing (such as applying the model to assess its quality) [35] . We argue that the integration of iterations with dataflows, rather than the creation of specialized systems, is important for several reasons: first, an integrated approach enables many analytical pipelines to be expressed in a unified fashion, eliminating the need for an orchestration framework. Second, dataflows have been long known to lend themselves well to optimization, not only in database systems, but also when using more flexible programming models [7, 22] . Third, dataflows seem to be a well adopted abstraction for distributed algorithms, as shown by their increased popularity in the database and machine learning community [5, 35] . The contributions of this paper are the following: • We discuss how to integrate bulk iterations in a parallel dataflow system, as well as the consequences for the optimizer and execution engine (Section 4).
doi:10.14778/2350229.2350245 fatcat:mlgvsxdlrngntdm33n2drq6h5m