Managing Wire Delay in Large Chip-Multiprocessor Caches

B.M. Beckmann, D.A. Wood
37th International Symposium on Microarchitecture (MICRO-37'04)  
In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., and NuRapid [12] ) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks. Transmission Line Caches (TLC) [6] use on-chip transmission lines to provide low latency to all banks. Traditional stride-based hardware prefetching strives to tolerate, rather than reduce,
more » ... Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. Second, multiple threads mean multiple working sets, which compete for limited on-chip storage. Third, sharing code and data interferes with block migration, since one processor's low-latency bank is another processor's high-latency bank. In this paper, we develop L2 cache designs for CMPs that incorporate these three latency management techniques. We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, we demonstrate that block migration is less effective for CMPs because 40-60% of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, we observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, we show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, we present a hybrid design-combining all three techniquesthat improves performance by an additional 2% to 19% over prefetching alone. CMPs poses two problems. One, blocks shared by multiple processors are pulled in multiple directions and tend to congregate in banks that are equally far from all processors. Two, due to the extra freedom of movement, the effectiveness of block migration in a shared CMP cache is more dependent on "smart searches" [27] than its uniprocessor counterpart, yet smart searches are harder to implement in a CMP environment. Finally, we consider using on-chip transmission lines [8] to provide fast access to all cache banks [6]. On-chip transmission lines use thick global wires to reduce communication latency by an order of magnitude versus long conventional wires. Transmission Line Caches (TLCs) provide fast, nearly uniform, access latencies. However, the limited bandwidth of transmission lines-due to their large dimensions-may lead to a performance bottleneck in CMPs. This paper evaluates these three techniques-against a baseline NUCA design with L2 miss prefetching-using detailed full-system simulation and both commercial and scientific workloads. We make the following contributions:
doi:10.1109/micro.2004.21 dblp:conf/micro/BeckmannW04 fatcat:4diyn7mdgne4lj4j4dk6imytiy