Filters








91 Hits in 4.3 sec

Runahead execution: an alternative to very large instruction windows for out-of-order processors

O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt
The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.  
This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor, without requiring an unreasonably large instruction window.  ...  Today's high performance processors tolerate long latency operations by means of out-of-order execution.  ...  We also thank the other members of the HPS and Intel Labs research groups for the fertile environments they help create. This work was supported by an internship provided by Intel.  ... 
doi:10.1109/hpca.2003.1183532 dblp:conf/hpca/MutluSWP03 fatcat:f5xg74tvifda5exk2bxbhrlnv4

Runahead execution: an effective alternative to large instruction windows

O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt
2003 IEEE Micro  
Unfortunately, main memory latencies are so long that out-of-order processors require large instruction windows to tolerate them.  ...  High-performance processors execute instructions out of program order to tolerate long latencies and extract instruction-level parallelism.  ...  Stark has a BS in elec-trical engineering and an MS and a PhD in computer engineering, all from the University of Michigan. He is a member of the IEEE.  ... 
doi:10.1109/mm.2003.1261383 fatcat:ygjuu3cywzgttoxftmvj3oq764

Techniques for Efficient Processing in Runahead Execution Engines

Onur Mutlu, Hyesoon Kim, Yale N. Patt
2005 SIGARCH Computer Architecture News  
A runahead processor executes significantly more instructions than a traditional outof-order processor, sometimes without providing any performance benefit, which makes it inefficient.  ...  The techniques we propose reduce the increase in the number of instructions executed due to runahead execution from 26.5% to 6.2%, on average, without significantly affecting the performance improvement  ...  Acknowledgments We thank Mike Butler, Nhon Quach, Jared Stark, Santhosh Srinath, and other members of the HPS research group for their helpful comments on drafts of this paper.  ... 
doi:10.1145/1080695.1070000 fatcat:4k6bk5qrcvbltlctqcvexkiwdq

Runahead Threads to improve SMT performance

Tanausu Ramirez, Alex Pajuelo, Oliverio J. Santana, Mateo Valero
2008 High-Performance Computer Architecture  
In this paper, we propose Runahead Threads (RaT) as a valuable solution for both reducing resource contention and exploiting memory-level parallelism in Simultaneous Multithreaded (SMT) processors.  ...  We compare an SMT architecture using RaT to both state-of-the-art static fetch policies and dynamic resource control policies.  ...  We also like to thank Rick Strong, Ramon Canal, and Manoj Gupta for their help in preparing the final version of this manuscript.  ... 
doi:10.1109/hpca.2008.4658635 dblp:conf/hpca/RamirezPSV08 fatcat:ovcvq3z4h5a6xndteuguds7ma4

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Rakesh Kumar, Mehdi Alipour, David Black-Schaffer
2019 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)  
This work revisits slice-out-of-order (sOoO) cores as an energy efficient alternative to OoO cores for MLP exploitation.  ...  These cores construct slices of MLP generating instructions and execute them out-of-order with respect to the rest of instructions.  ...  To enable these instructions to execute out-of-order as regards to the rest of the instructions, LSC adds an additional inorder instruction queue, called the bypass queue (B-IQ).  ... 
doi:10.1109/hpca.2019.00009 dblp:conf/hpca/KumarAB19 fatcat:atg4v7g6nnajvojfyspinqhfyq

Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery

Yi Ma, Hongliang Gao, Martin Dimitrov, Huiyang Zhou
2007 IEEE Transactions on Parallel and Distributed Systems  
Such reexecution is the key to eliminating the centralized structures that are normally associated with very large instruction windows.  ...  Dual-core execution (DCE) is an execution paradigm proposed to utilize chip multiprocessors to improve the performance of single-threaded applications.  ...  ACKNOWLEDGMENTS The authors thank the anonymous reviewers for their insightful and valuable comments.  ... 
doi:10.1109/tpds.2007.4288106 fatcat:rww3jiwbgnebpizagla44fedya

CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution [article]

Milad Mohammadi, Tor M. Aamodt, William J. Dally
2016 arXiv   pre-print
We introduce the Coarse-Grain Out-of-Order (CG- OoO) general purpose processor designed to achieve close to In-Order processor energy while maintaining Out-of-Order (OoO) performance.  ...  Through the energy efficiency techniques applied to the compiler and processor pipeline stages, CG-OoO closes 64% of the average energy gap between the In-Order and Out-of-Order baseline processors at  ...  INTRODUCTION This paper revisits the Out-of-Order (OoO) execution model and devises an alternative model that achieves the performance of the OoO at over 50% lower energy cost. Czechowski et al.  ... 
arXiv:1606.01607v1 fatcat:rzqeu325szbpzezg4oinpq7szu

Register File Size Reduction through Instruction Pre-Execution Incorporating Value Prediction

Yusuke TANAKA, Hideki ANDO
2010 IEICE transactions on information and systems  
Both are due to data dependencies among the pre-executed instructions. This paper proposes the use of value prediction to solve these problems.  ...  Ideally, TSD allows exploitation of MLP under an unlimited number of physical registers, and consequently only a small register file is needed for MLP.  ...  Ichihara for their help to collect evaluation data.  ... 
doi:10.1587/transinf.e93.d.3294 fatcat:wne3on3jlrgv3kz66vgvl3whey

Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Yuan Chou, Brian Fahs, Santosh Abraham
2004 SIGARCH Computer Architecture News  
instructions is needed to attain the full benefits of large out-of-order instruction windows.  ...  Simulation results show that a moderately aggressive out-of-order issue processor improves MLP over an in-order issue processor by 12-30%, and that aggressive handling of loads, branches and serializing  ...  We would also like to thank Craig Anderson, Sorin Iacobovici, Gurindar Sohi, Rabin Sugumar, Marc Tremblay and Stevan Vlaovic for reviewing early drafts of this paper.  ... 
doi:10.1145/1028176.1006708 fatcat:oqnkkj5w3zdcnbmf5jgjz66ti4

Improving single-thread performance with fine-grain state maintenance

Peng Zhou, Soner Õnder
2008 Proceedings of the 2008 conference on Computing frontiers - CF '08  
We evaluate an SMT-like fine grain state processor and show that it obtains an average of 38.9% and up to 160.0% better performance than coarse-grain baseline processors on the SPEC CFP2000 benchmark suite  ...  state to an independent thread.  ...  Unlike CFP which utilizes very large hierarchical load and store queues to buffer all in-flight load and store instructions, FSG-RA needs small load/store queues and forks a second thread to verify and  ... 
doi:10.1145/1366230.1366274 dblp:conf/cf/ZhouO08 fatcat:aeptbndairdmfd7akk6nht5fx4

Techniques for Efficient Processing in Runahead Execution Engines

O. Mutlu, Hyesoon Kim, Y.N. Patt
32nd International Symposium on Computer Architecture (ISCA'05)  
A runahead processor executes significantly more instructions than a traditional outof-order processor, sometimes without providing any performance benefit, which makes it inefficient.  ...  The techniques we propose reduce the increase in the number of instructions executed due to runahead execution from 26.5% to 6.2%, on average, without significantly affecting the performance improvement  ...  Acknowledgments We thank Mike Butler, Nhon Quach, Jared Stark, Santhosh Srinath, and other members of the HPS research group for their helpful comments on drafts of this paper.  ... 
doi:10.1109/isca.2005.49 dblp:conf/isca/MutluKP05 fatcat:o3o3356h5jfvvnwdfcwgur5fey

Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

Akihiro Yamamoto, Yusuke Tanaka, Hideki Ando, Toshio Shimada
2007 Proceedings of the 2007 workshop on MEmory performance DEaling with Applications, systems and architecture - MEDEA '07  
This paper proposes an instruction pre-execution scheme that reduces latency and early scheduling of loads for a high performance processor.  ...  Instructions wait for the final deallocation as a second step in the instruction window. While waiting, the scheme allows pre-execution of instructions.  ...  This work was partially supported by The Ministry of Education, Culture, Sports, Science and Technology Grant-in-Aid for Scientific Research (C)(No. 19500041).  ... 
doi:10.1145/1327171.1327175 fatcat:mcw42awyqrcgxg2ggofnk2mpwi

MLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism

Yuya KORA, Kyohei YAMAGUCHI, Hideki ANDO
2014 IEICE transactions on information and systems  
A promising method to overcome this memory wall is aggressive out-of-order execution by extensively enlarging the instruction window resources to exploit memory-level parallelism (MLP).  ...  One of the reasons for this is that there is a speed gap between the processor and main memory, known as the memory wall.  ...  Acknowledgments The authors wish to thank T. Inagaki for his contributions to our study.  ... 
doi:10.1587/transinf.2014edp7177 fatcat:yotax4kpdfav5ku26gmfgtmbte

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses

O. Mutlu, Hyesoon Kim, Y.N. Patt
2006 IEEE transactions on computers  
We analyze why and for what kind of loads AVD prediction works and describe the design of an implementable AVD predictor.  ...  An AVD predictor keeps track of the address (pointer) load instructions for which the arithmetic difference (i.e., delta) between the effective address and the data value is stable.  ...  This paper is an extended and revised version of [23] .  ... 
doi:10.1109/tc.2006.191 fatcat:fz2lpgdfzbh2thua3dhl5uolte

Efficient execution of memory access phases using dataflow specialization

Chen-Han Ho, Sung Jin Kim, Karthikeyan Sankaralingam
2015 SIGARCH Computer Architecture News  
We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor.  ...  These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation.  ...  Support for this research was provided by NSF under the following grants CCF-1162215, CNS-1228782, CNS-1218432.  ... 
doi:10.1145/2872887.2750390 fatcat:c2malodvtfcxjn2uxmowb5heci
« Previous Showing results 1 — 15 out of 91 results