The effects of explicitly parallel mechanisms on the multi-ALU processor cluster pipeline
Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273)
Continuing reductions in on-chip geometries yield increasing numbers of transistors per chip and fundamentally faster devices but also result in effectively slower wires. This combination presents significant challenges for new microprocessor architectures. The disparity in performance between on-chip arithmetic units and memory creates longer effectively latencies. The changing balance between gate delay and wire delay penalizes global interactions. The MIT Multi-ALU Processor (MAP)
... or (MAP) architecture incorporates three explicitly parallel mechanisms to address these challenges. Efficient intercluster interactions enable instruction scheduling across clustered arithmetic units. Deferred exceptions based on ERRVAL's facilitate aggressive instruction reordering and speculation. Zero-cycle multithreading provides latency tolerance without sacrificing single threaded performance. In this paper, we describe each of these mechanisms and quantify their impact on the area and routing of the cluster pipeline in the 5 Million transistor MAP chip. Zero-cycle multithreading accounts for over 44% of the total cluster area. Support for ERRVAL's requires very little area (less than 4%). The intercluster interaction mechanisms require minimal cluster area and less than 5% of the available global routing resources, but enable fully general access across clusters and between all arithmetic units. Description of MIT Multi-ALU Processor The MAP maximizes on-chip performance by simultaneously supporting parallelism at all granularities. As shown in Figure 1 , the MAP 1 contains three independent 1 The original MIT MAP architecture  included four symmetric processing clusters; a quad-banked memory system; the two switches; and a network interface/3D-mesh-router.