Luis Ceze, Karin Strauss, James Tuck, Josep Torrellas, Jose Renau
2006 ACM Transactions on Architecture and Code Optimization (TACO)  
Modern superscalar processors often suffer long stalls because of load misses in on-chip L2 caches. To address this problem, we propose hiding L2 misses with Checkpoint-Assisted VAlue prediction (CAVA). On an L2 cache miss, a predicted value is returned to the processor. When the missing load finally reaches the head of the ROB, the processor checkpoints its state, retires the load, and speculatively uses the predicted value and continues execution. When the value in memory arrives at the L2
more » ... he, it is compared to the predicted value. If the prediction was correct, speculation has succeeded and execution continues; otherwise, execution is rolled back and restarted from the checkpoint. CAVA uses fast checkpointing, speculative buffering, and a modest-sized value prediction structure that has about 50% accuracy. Compared to an aggressive superscalar processor, CAVA speeds up execution by up to 1.45 for SPECint applications and 1.58 for SPECfp applications, with a geometric mean of 1.14 for SPECint and 1.34 for SPECfp applications. We also evaluate an implementation of Runahead execution-a previously proposed scheme that does not perform value prediction and discards all work done between checkpoint and data reception from memory. Runahead execution speeds up execution by a geometric mean of 1.07 for SPECint and 1.18 for SPECfp applications, compared to the same baseline. • 183 • L. Ceze et al. a confidence estimator to minimize wasted work on rollbacks because of mispeculations. If the confidence on a new value prediction is low, the processor commits its current speculative state and then creates a new checkpoint before consuming the new prediction. In our evaluation, we perform an extensive characterization of the architectural behavior of CAVA, as well as a sensitivity analysis of different architectural parameters. CAVA is related to Runahead execution [Mutlu et al. 2003 ] and the concurrently developed CLEAR scheme [Kirman et al. 2005] . Specifically, Runahead also uses checkpointing to allow processors to retire missing loads and continue execution. However, Runahead and CAVA differ in three major ways. First, in Runahead there is no prediction: the destination register of the missing load is marked with an invalid tag, which is propagated by dependent instructions. Second, in Runahead, when the data arrives from memory, execution is always rolled back; in CAVA, if the prediction is correct, execution is not rolled back. Finally, while Runahead buffers (potentially incomplete) speculative state in a processor structure called Runahead cache, CAVA buffers the whole speculative state in L1. We evaluate Runahead without and with value prediction. Compared to CLEAR, our implementation of CAVA offers a simpler design. Specifically, the value prediction engine is located close to the L2 cache, off the critical path, and is trained only with L2 misses. In CLEAR, prediction and validation mechanisms are located inside the processor core. Moreover, to simplify the design, CAVA explicitly chooses to support only a single outstanding checkpoint at a time, and terminates the current speculative section when a low-confidence prediction is found. CLEAR supports multiple concurrent checkpoints, which requires storing several register checkpoints at a time, and separately recording in the speculative buffer the memory state of each checkpoint. Finally, we show how to support CAVA in multiprocessors, an area not considered by CLEAR. A longer discussion on how CAVA and CLEAR compare is presented in Section 7. Our simulations show that, relative to an aggressive conventional superscalar baseline, CAVA speeds up execution by up to 1.45 for SPECint applications and 1.58 for SPECfp applications, with a geometric mean of 1.14 for SPECint and 1.34 for SPECfp. Compared to the same baseline, Runahead obtains geometric mean speedups of 1.07 and 1.18 in SPECint and SPECfp applications, respectively. This paper is organized as follows: Section 2 presents background information. Section 3 describes design issues in CAVA. Section 4 presents our microarchitectural implementation. Section 5 presents our evaluation methodology. Section 6 evaluates our implementation and variations. Finally, Section 7 discusses related work. BACKGROUND Miss Status Holding Registers (MSHRs) Miss Status Holding Registers (MSHRs) [Kroft 1981 ] hold information about requests that miss in the cache. Typically, an MSHR is allocated when a miss
doi:10.1145/1138035.1138038 fatcat:b7gutp6sjndbxfjk4v4mckejme