Imagine: media processing with streams

B. Khailany, W.J. Dally, U.J. Kapasi, P. Mattson, J. Namkoong, J.D. Owens, B. Towles, A. Chang, S. Rixner
2001 IEEE Micro  
Media-processing applications, such as signal processing, 2D-and 3D-graphics rendering, and image and audio compression and decompression, are the dominant workloads in many systems today. The real-time constraints of media applications demand large amounts of absolute performance and high performance densities (performance per unit area and per unit power). Therefore, mediaprocessing applications often use specialpurpose (custom), fixed-function hardware. General-purpose solutions, such as
more » ... rammable digital signal processors (DSPs), offer increased flexibility but achieve performance density levels two or three orders of magnitude worse than special-purpose systems. One reason for this performance density gap is that conventional general-purpose architectures are poorly matched to the specific properties of media applications. These applications share three key characteristics. First, operations on one data element are largely independent of operations on other elements, resulting in a large amount of data parallelism and high latency tolerance. Second, there is little global data reuse. Finally, the applications are computationally intensive, often performing 100 to 200 arithmetic operations for each element read from off-chip memory. Conventional general-purpose architectures don't efficiently exploit the available data parallelism in media applications. Their memory systems depend on caches optimized for reducing latency and data reuse. Finally, they don't scale to the numbers of arithmetic units or registers required to support a high ratio of computation to memory access. In contrast, special-purpose architectures take advantage of these characteristics because they effectively exploit data parallelism and computational intensity with a large number of arithmetic units. Also, special-purpose processors directly map the algorithm's dataflow graph into hardware rather than relying on memory systems to capture locality. Another reason for the performance density gap is the constraints of modern technology. Modern VLSI computing systems are limited by communication bandwidth rather than arithmetic. For example, in a contemporary 0.15-micron CMOS technology, a 32bit integer adder requires less than 0.05 mm 2 of chip area. Hundreds to thousands of these arithmetic units fit on an inexpensive 1-cm 2 chip. The challenge is supplying them with instructions and data. General-purpose processors that rely on global structures such as large multiported register files to provide
doi:10.1109/40.918001 fatcat:u2pddgyyjra2lgzsvsguvyqddi