An object-oriented bulk synchronous parallel library for multicore programming

A.N. Yzelman, Rob H. Bisseling
2011 Concurrency and Computation  
We show that the Bulk Synchronous Parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proofof-concept MulticoreBSP library has been implemented in Java, and is used to show that BSP algorithms can attain proper speedups on multicore architectures. This library is based on the BSPlib implementation, adapted to an object-oriented
more » ... etting. In comparison, the number of function primitives is reduced, while the overall design simplicity is improved. We detail applying the BSP model and library on the sparse matrix-vector (SpMV) multiplication problem, and show by performing numerical experiments that the resulting BSP SpMV algorithm attains speedups, in one case reaching a speedup of 3.5 for 4 threads. While not described in detail in this paper, algorithms for the fast Fourier transform and the dense LU decomposition are also investigated; in one case attaining superlinear speedups of 5 for 4 threads. The predictability of BSP algorithms in the case of the sparse matrix-vector multiply is also investigated. The choice of Java as our programming language underscores the portability goal of BSP. As an interpreted language, the library and applications built on it can be distributed to every machine for which a Java interpreter is available, without recompiling. Furthermore, using Java means that the BSP model is implemented in an object-oriented way, making the model easier to use while also reducing the number of BSP primitives from the already low number of 20 primitives. This is important since it enables easy learning, and helps with keeping the (idealised) parallel machine transparent. Restricting the number of primitives also restricts the programmer, but this is preferable to simplicity without restriction. For example, with OpenMP [6], parallelism may be introduced in code by adding a parallel directive just before a for-loop. This is simple and nonrestrictive, but relies on programmer expertise to decide on the scope of variables. Variables defined before the directive is encountered are shared amongst threads, while those declared within the parallel block are considered local. The programmer must therefore be wary to avoid race conditions, and actively avoid the unintended sharing of variables. In contrast, the MulticoreBSP library adds robustness to algorithms by implicitly assuming all variables are local unless explicitly defined otherwise, while retaining programming simplicity. Also, by structuring computations in independent, sequential phases, separated by synchronisation barriers, another common trap in parallel programming is avoided: deadlocks. Usually these occur only rarely in parallel programs, making them hard to detect; while in BSP such deadlocks can only occur as a sequential error, not by misuse of communication primitives. In summary, we believe the BSP model has the following desirable properties: • it is easy to learn by programmers and very transparent to them, • it models both distributed-memory and shared-memory architectures, • it predicts performance, and • it is robust with respect to common parallel programming pitfalls. This article has two aims: first, introduce the MulticoreBSP communications library, and second, demonstrate how to design shared-memory algorithms in BSP. In particular, we show that it is possible to exploit differences between distributed-memory and shared-memory architectures on the level of implementation. To this end, we introduce MulticoreBSP by implementing a shared-memory version of the sparse matrix-vector (SpMV) multiplication, and compare it to the original distributed-memory version from BSPedupack. Similar efforts have been made for dense LU decomposition and the Fast Fourier transform. Our algorithm listings are close to the actual Java code, to demonstrate briefly but properly how the library operates in practice. Within our listings, all classes, member functions and variables defined in the library are printed in italic, while class and function names not defined therein are printed in typewriter font. Function code in listings is printed in plain roman, with the only further exception of reserved words, such as for and return, which are printed boldface. The remainder of the article is structured as follows. The MulticoreBSP library is introduced in Section 2. Using this proof-of-concept Java library, Section 3 details the BSP SpMV algorithm for multicore systems, which is put to the test in Section 4. This section also reports on performance of the dense LU and FFT algorithms. Conclusions and future work follow in Section 5. 1.1. BSP model. The original BSP model targets distributed-memory systems and models them using four parameters: p, g, l, and r. The total number of parallel computing units (cores) of a BSP machine is given by p, while r measures the speed of each such core, in flops per second. Each core executes the same given program, usually working on different data; that is, BSP is a single program, multiple data (SPMD) model. The parallel program is broken down in supersteps, which are separated by synchronisation barriers. During a superstep, a process can execute any instruction, but it cannot communicate with other processes. It can, however, queue communication requests. When a process encounters a synchronisation barrier, execution halts until all processes encounter this barrier, upon which the next superstep is started concurrently. At synchronisation, all queued communication requests from the previous superstep are processed; the only communication thus occurs as part of synchronisation. The communication costs occurring this way are modelled in BSP by l, which models the time required to get past a synchronisation
doi:10.1002/cpe.1843 fatcat:h3eau35vbvhndhrpjsztyhvpfy