The Stanford Hydra CMP

L. Hammond, B.A. Hubbert, M. Siu, M.K. Prabhu, M. Chen, K. Olukolun
2000 IEEE Micro  
The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors and their primary caches on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming, a paradigm that
more » ... ws performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the architecture of the Hydra design with a focus on its speculative thread support, and describes our prototype implementation. Why build a CMP? As Moore's law allows increasing numbers of smaller and faster transistors to be integrated on a single chip, new processors are being designed to use these transistors effectively to improve performance. Today, most microprocessor designers use the increased transistor budgets to build larger and more complex uniprocessors. However, several problems are beginning to make this approach to microprocessor design difficult to continue. To address these problems, we have proposed that future processor design methodology shift from simply making progressively larger uniprocessors to implementing more than one processor on each chip. 1 The following discusses the key reasons why single-chip microprocessors are a good idea. Parallelism Designers primarly use additional transistors on chips to extract more parallelism from programs to perform more work per clock cycle. While some transistors are used to build wider or more specialized data path logic (to switch from 32 to 64 bits or add special multimedia instructions, for example), most are used to build superscalar processors. These processors can extract greater amounts of instruction-level parallelism, or ILP, by finding nondependent instructions that occur near each other in the original program code. Unfortunately, there is only a finite amount of ILP present in any particular sequence of instructions that the processor executes because instructions from the same sequence are typically highly interdependent. As a result, processors that use this technique are seeing diminishing returns as they attempt to execute more instructions per clock cycle, even as the logic required to process multiple instructions per clock cycle increases quadratically. A CMP avoids this limitation by primarily using a completely different type of parallelism: thread-level parallelism. We obtain TLP by running completely separate sequences of instructions on each of the separate processors simultaneously. Of course, a CMP may also exploit small amounts of ILP within each of its individual processors, since ILP and TLP are orthogonal to each other. Wire delay As CMOS gates become faster and chips become physically larger, the delay caused by interconnects between gates is becoming more
doi:10.1109/40.848474 fatcat:hwou4dbdqfhi5clj6o23atuaka