Hardware-modulated parallelism in chip multiprocessors

Julia Chen, Philo Juang, Kevin Ko, Gilberto Contreras, David Penry, Ram Rangan, Adam Stoler, Li-Shiuan Peh, Margaret Martonosi
2005 SIGARCH Computer Architecture News  
Chip multi-processors (CMPs) already have widespread commercial availability, and technology roadmaps project enough on-chip transistors to replicate tens or hundreds of current processor cores. How will we express parallelism, partition applications, and schedule/place/migrate threads on these highlyparallel CMPs? This paper presents and evaluates a new approach to highlyparallel CMPs, advocating a new hardware-software contract. The software layer is encouraged to expose large amounts of
more » ... -granular, heterogeneous parallelism. The hardware, meanwhile, is designed to offer low-overhead, low-area support for orchestrating and modulating this parallelism on CMPs at runtime. Specifically, our proposed CMP architecture consists of architectural and ISA support targeting thread creation, scheduling and context-switching, designed to facilitate effective hardware run-time mapping of threads to cores at low overheads. Dynamic modulation of parallelism provides the ability to respond to run-time variability that arises from dataset changes, memory system effects and power spikes and lulls, to name a few. It also naturally provides a long-term CMP platform with performance portability and tolerance to frequency and reliability variations across multiple CMP generations. Our simulations of a range of applications possessing do-all, streaming and recursive parallellism show speedups of 4-11.5X and energy-delay-product savings of 3.8X, on average, on a 16core vs. a 1-core system. This is achieved with modest amounts of hardware support that allows for low overheads in thread creation, scheduling and context-switching. In particular, our simulations motivated the need for hardware support, showing that the large thread management overheads of current run-time software systems can lead to up to 6.5X slowdown. The difficulties faced in static scheduling were shown in our simulations with a static scheduling algorithm, fed with oracle profiled inputs suffering up to 107% slowdown compared to NDP's hardware scheduler, due to its inability to handle memory system variabilities. More broadly, we feel that the ideas presented here show promise for scaling to the systems expected in ten years, where the advantages of high transistor counts may be dampened by difficulties in circuit variations In Proceedings of Workshop on Design, Architecture and Simulation of Chip Multi-Processors Conference (dasCMP), held in conjunction with the 38th Annual International Symposium on Microarchitecture (MICRO), Barcelona, Spain, November 2005. and reliability. These issues will make dynamic scheduling and adaptation mandatory; our proposals represent a first step towards that direction.
doi:10.1145/1105734.1105742 fatcat:d5iferqj5fghrmndyw6b4ill6y