The SARC Architecture

Alex Ramirez, Felipe Cabarcas, Ben Juurlink, Mauricio Alvarez Mesa, Friman Sanchez, Arnaldo Azevedo, Cor Meenderinck, Catalin Ciobanu, Sebastian Isaza, Gerogi Gaydadjiev
2010 IEEE Micro  
parallel computation shows great promise for scaling raw processing performance within a given power budget. However, chip multiprocessors (CMPs) often struggle with programmability and scalability issues such as cache coherency and offchip memory bandwidth and latency. Programming a multiprocessor system not only requires the programmer to discover parallelism in the application, it also requires mapping threads to processors, distributing data to optimize locality, scheduling data transfers
more » ... hide latencies, and so on. These programmability issues translate to a difficulty in generating sufficient computational work to keep all on-chip processing units busy. This issue is attributable to the use of inadequate parallel programming abstractions and the lack of runtime support to manage and exploit parallelism. The SARC architecture is based on a heterogeneous set of processors managed at runtime in a master-worker mode. Runtime management software detects and exploits task-level parallelism across multiple workers, similarly to how an out-of-order superscalar processor dynamically detects instructionlevel parallelism (ILP) to exploit multiple functional units. SARC's runtime ability to schedule data transfers ahead of time allows applications to tolerate long memory latencies. We thus focus the design on providing sufficient bandwidth to feed data to all workers. Performance evaluations using a set of applications from the multimedia, bioinformatics, and scientific domains (see the "Target Applications" sidebar for a description of these applications) demonstrate the SARC architecture's potential for a broad range of parallel computing scenarios, and its performance scalability to hundreds of on-chip processors. Programming model The SARC architecture targets a new class of task-based data-flow programming models that includes StarSs, 1 Cilk, 2 RapidMind, 3 Sequoia, 4 and OpenMP 3.0. 5 These programming models let programmers write efficient parallel programs by identifying candidate functions to be off-loaded to worker processors. StarSs also allows annotating the task input and output operands, thereby enabling the runtime system to reason about intertask data dependencies when scheduling tasks and data transfers. StarSs, the programming model used in this article, consists of a source-to-source compiler and a supporting runtime library. The compiler translates C code, with annotations of the task's
doi:10.1109/mm.2010.79 fatcat:xle4zkaarnbdvlryyq7f674544