Missing the memory wall

Ashley Saulsbury, Fong Pong, Andreas Nowatzyk
1996 Proceedings of the 23rd annual international symposium on Computer architecture - ISCA '96  
Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance. This paper argues for an integrated system approach that uses less-powerful CPUs that
more » ... e tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems. We present results from execution driven uni-and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache. Background The relatively good performance of Sun's Sparc-Station 5 workstation (SS-5), with respect to contemporary high-end models, provides evidence for the benefits of tighter memory-processor integration. Targeted at the "low-end" of the architecture spectrum, the SS-5 contains a single-scalar MicroSparc CPU with single-level, small, on-chip caches (16KByte instruction, 8KByte data). For machine simplicity the memory controller was integrated into the CPU, so the DRAM devices are driven directly by logic on the processor chip. A separate I/O-bus connects the CPU with peripheral devices, which can access memory only through the CPU chip. A comparable "high-end" machine of the same era is the Sparc-Station 10/61 (SS-10/61), containing a super-scalar SuperSparc CPU with two cache levels; separate 20KB instruction and 16KB data caches at level 1, and a shared 1MByte of cache at level 2.
doi:10.1145/232973.232984 dblp:conf/isca/SaulsburyPN96 fatcat:ut72ah2zxzh73onrac3vems5aq