The TigerSHARC DSP architecture

J. Fridman, Z. Greenfield
2000 IEEE Micro  
In the past two years, several multiple data path and pipelined digital signal processors have been introduced to the marketplace. This new generation of DSPs takes advantage of higher levels of integration than were available for their predecessors. It also incorporates multiple execution units on a single core as well as deep execution pipelines. For an introduction to recent trends in DSPs see Eyer and Bier, 1 and for comprehensive analysis on DSP chips see the DSP buyer's guide 2 and Levy.
more » ... Here, we describe a new parallel DSP architecture called TigerSHARC. 4,5 We focus on the computational aspects of its core and onchip memory architecture. To sustain the high computation rates of cores with multiple execution units, memory subsystems must scale proportionately. We based our solution to the high-bandwidth demands of this parallel DSP core on a memory architecture characterized by what we call short-vector processor techniques. These techniques are essentially smallwidth vector processor interfaces. In addition to the architectural description, we also present an application example of a finite-length impulse response, or FIR, filter. We use this example to illustrate a technique used to map this class of algorithms to a parallel, vector-oriented processor. The FIR fil-ter is a representative member of a large class of DSP algorithms, namely any structure with delay lines such as infinite-length impulse response, or IIR, structures, equalizers, and multirate filters, all of which share similar solutions. (Two-dimensional extensions of these algorithms, such as 2D filtering and convolution used in imaging, can also be solved using extensions to the techniques presented here.) To efficiently map this class of algorithms to this parallel DSP, we must address two related problems: the distribution of computation among several execution units, and the provision of adequate alignment between data and filter coefficients. To map the delay line structure of the FIR, we apply an algorithmic transformation to the algorithm, and, as a result, expose its parallelism in a form suited to the target architecture. This algorithmic transformation produces a high efficiency implementation by relying only on aligned short-vector memory accesses. This example also shows that the conventional single-instruction, multiple-data (SIMD) dispatch mechanism, although very effective in simple linear algebra and matrix operations, may be overly restrictive when applied to this class of DSP algorithms. And, as a result, non-SIMD execution is required to achieve high efficiency.
doi:10.1109/40.820055 fatcat:2dgje6lpqjhu3hqeu2befusmvm