Design issues and tradeoffs for write buffers
K. Skadron, D.W. Clark
Proceedings Third International Symposium on High-Performance Computer Architecture
Processors with write-through caches typically require a write buffer to hide the write latency to the next level of memory hierarchy and to reduce write traffic. A write buffer can cause processor stalls when it is full, when it contends with a cache miss for access to the next level of the hierarchy, and when it contains the freshest copy of data needed by a load. This paper uses instructionlevel simulation of SPEC92 benchmarks to investigate how different write buffer depths, retirement
... ies, and load-hazard policies affect these three types of write-buffer stalls. Deeper buffers with adequate headroom, lazier retirement policies, and the ability to read data directly from the write buffer combine to substantially reduce write-buffer-induced stalls. Introduction Processor speeds continue to increase much faster than memory speeds, threatening application performance with increasing stall time for both reads and writes. Current processors attempt to bridge the gap with a variety of old and new techniques: multiple levels of caches, non-blocking loads, prefetching, and write buffers are just a few examples. With a few exceptions, published work in this area focuses on improving the performance of read operations. Since poor write behavior can substantially penalize performance and writes manifestly differ from reads, work to improve memory hierarchy performance must include write-specific techniques. We address some performance issues that arise in the design of processor write buffers. In a system with a write-through first-level cache, a write buffer has two essential functions: it absorbs processor writes (store instructions) at a rate faster than the next-level cache could, thereby preventing processor stalls; and it aggregates writes to the same cache block, thereby reducing traffic to the next-level cache. These design objectives are unfortunately in conflict. The first function is best fulfilled when the buffer is empty, but the second is best fulfilled when it is full of recently-written blocks. Good write buffer designs achieve a balance between these functions. This paper considers write buffer designs for systems with at least two levels of cache. Many processors place the first-level (L1) cache on-chip to get the fastest possible hit times, so cycle time plays an important role in the L1 design. This means that Copyright c 1997 IEEE.