On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

Ardavan Pedram, Andreas Gerstlauer, Robert A. van de Geijn
2012 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing  
Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the
more » ... ng of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75x better power and up to 10x better area efficiency compared to traditional SIMD architectures. I. INTRODUCTION Application-specific design of hardware accelerators can provide orders of magnitude improvements in power and area efficiency [14] . However, full-custom design is costly in many aspects. As we are entering the era of heterogeneous computing, a key question therefore becomes how to design specialized cores that maintain the efficiency of full custom hardware while providing enough flexibility to execute whole classes of coarse-grain operations. Data-parallel and streaming processors, such as GPUs, have received widespread attention as an integral component for heterogeneous architectures. There, matrix operations, which are at the core of many high-performance computing problems, are often a prime target for acceleration. Matrix operations exhibit ample computational parallelism that can be relatively easily exploited. However, a crucial concern and often limiting factor is the efficient realization of data movements on a communication architecture that is able to optimally exploit locality, minimize overhead, effectively overlap computation with communication and hide communication latencies. Linear algebra computations can be efficiently reduced down to a canonical set of Basic Linear Algebra Subroutines (BLAS), such as matrix-matrix and matrix-vector operations [7] . In previous work [34] , [35] , we examined the design of a proposed Linear Algebra Core (LAC). The LAC is based
doi:10.1109/sbac-pad.2012.35 dblp:conf/sbac-pad/PedramGG12 fatcat:uprnlnt7ffarxc4j6zwv7omdru