A quantitative framework for automated pre-execution thread selection
A. Roth, G.S. Sohi
35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings.
Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is ineffective. In pre-execution, copies of cache miss computations are isolated from the main program and launched as separate threads called p-threads whenever the processor anticipates an upcoming miss. P-thread selection is the task of deciding what computations should execute on p-threads and when they should be launched such that total execution time is minimized. P-thread selection is central
... o the success of pre-execution. We introduce a framework for automated static p-thread selection, a static p-thread being one whose dynamic instances are repeatedly launched during the course of program execution. Our approach is to formalize the problem quantitatively and then apply standard techniques to solve it analytically. The framework has two novel components. The slice tree is a new data structure that compactly represents the space of all possible static p-threads. Aggregate advantage is a formula that uses raw program statistics and computation structure to assign each candidate static p-thread a numeric score based on estimated latency tolerance and overhead aggregated over its expected dynamic executions. Our framework finds the set of p-threads whose aggregate advantages sum to a maximum. The framework is simple and intuitively parameterized to model the salient microarchitecture features. We apply our framework to the task of choosing p-threads that cover L2 cache misses. Using detailed simulation, we study the effectiveness of our framework, and pre-execution in general, under different conditions. We measure the effect of constraining p-thread length, of adding localized optimization to p-threads, and of using various program samples as a statistical basis for the p-thread selection, and show that our framework responds to these changes in an intuitive way. In the microarchitecture dimension, we measure the effect of varying memory latency and processor width and observe that our framework adapts well to these changes. Each experiment includes a validation component which checks that the formal model presented to our framework correctly represents actual execution. 1. Pre-execution has also been proposed as a way of dealing with problem (i.e., frequently mis-predicted) branches. While we do not explicitly discuss branch pre-execution here, all of our methods do apply in that scenario. 2 one another, has many advantages. P-thread execution and cache miss initiation are accelerated because p-threads are isolated from stalls and squashes that occur in the main thread. Overlapping is enhanced because while a cache miss stalls the p-thread, the main thread continues fetching, executing and retiring instructions from the main program. With hardware multithreading becoming prevalent, pre-execution is gaining popularity [3, 8, 11, 14, 20] . The benefits and limitations of pre-execution have been well-documented. Here, we attack the problem of p-thread selection, the task of deciding which p-threads to pre-execute and when to pre-execute them. P-thread selection is a crucial component of pre-execution. It is also a complex task that must balance many inter-related, often antagonistic concerns including cache miss latency tolerance, p-thread resource consumption (important when p-threads share resources with the main thread), and prefetch coverage and accuracy. To date, p-thread selection has been approached both manually  and automatically [2, 3, 5, 7, 11] and with promising results. However, past approaches have been generally heuristic. We present a framework for attacking the problem in a formal, quantitative, and holistic fashion. We focus on static p-threads, copies of which are launched repeatedly during program execution. The dynamic program intervals for which p-threads are chosen can be short, modeling on-the-fly p-thread generation, or a full run, modeling an off-line implementation. For each program sample, we select p-threads using what is effectively an analytical pre-execution limit study. First, we use an execution trace to enumerate all possible static p-threads. Then, we apply a simple model called aggregate advantage to calculate the performance benefit of each static p-thread aggregated over its dynamic invocations. Finally, we "solve" the selection problem by choosing the set of static p-threads that maximizes total performance benefit. Two novel components make this approach feasible. The first is aggregate advantage, which uses a few key abstractions to effectively model the microscopic interactions of a p-thread with the main thread using only a few intuitive high level parameters. The second is the slice tree, a data structure that compactly represents the space of all possible static p-threads and the relationships between them. The slice tree allows us to accurately assess miss coverage and to ensure that pre-execution work is not replicated. The framework also includes facilities for optimizing p-threads. Constructed from first principles, the framework is simple and, via a few intuitive parameters, applicable to a wide range of pre-execution implementations and processor configurations. In this work, we assume a simultaneous multithreading (SMT)  processor, where resources are shared among all threads. The framework, however, is easily adapted to other multithreaded models. At first glance, the use of exhaustive analysis on dynamic execution traces seems impractical: the trace-driven approach meshes well with dynamic optimization while exhaustive search seems a better fit for off-line implementations. However, representative execution samples can be obtained for off-line analysis or reconstructed from profiles and the structure of the problem allows us to perform our exhaustive search using a simple iterative procedure that converges quickly. Independently, the framework has intrinsic value in that the p-threads it finds are optimal insofar as aggregate advantage accurately models pre-execution. The conditional optimality statement derives from the standard iterative techniques we use to solve the problem. To remove the condition, we use correlation and cross-validation methodologies to measure the fidelity of aggregate advantage. Our results show that, although simple, this formula is quite accurate under many conditions. While perhaps not always optimal in reality, the p-threads produced by our framework are often close to it. Thus, our framework provides a robust analytical foundation for future pthread selection algorithms. In addition, it allows us to characterize p-threads and evaluate the performance potential of pre-execution under different processor and pre-execution configurations and conditions. In this paper, we do exactly that in the context of L2 misses. Our experiments confirm an intuitive result-maximum pre-execution effec-