Filters








95 Hits in 6.1 sec

The runahead network-on-chip

Zimo Li, Joshua San Miguel, Natalie Enright Jerger
2016 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)  
The Runahead Network-On-Chip With more cores per chip multiprocessor and higher memory demands from applications, it is imperative that networks-on-chip (NoC) provides low-latency, power-efficient communication  ...  We propose the Runahead NoC, a lightweight, lossy network that provides single-cycle hops.  ...  Prioritization Schemes on Network on Chip Traditional Network-on-Chips employ simple arbitration strategies for packets. These strategies include round-robin or age-based arbitration.  ... 
doi:10.1109/hpca.2016.7446076 dblp:conf/hpca/LiMJ16 fatcat:ix2ppbfvwfadtgp6mtsyemw3xa

On-Chip Mechanisms to Reduce Effective Memory Access Latency [article]

Milad Hashemi
2016 arXiv   pre-print
Independent cache misses have all of the source data that is required to generate the address of the memory access available on-chip, while dependent cache misses depend on data that is located off-chip  ...  Independent cache misses are accelerated using a new mode for runahead execution that only executes filtered dependence chains.  ...  This distinction is made on the basis of whether all source data for the cache miss is available on-chip or off-chip.  ... 
arXiv:1609.00306v1 fatcat:hh2lxatnhfdz5mekvmim2p5a24

Parallelizing Bisection Root-Finding: A Case for Accelerating Serial Algorithms in Multicore Substrates [article]

Mohammad Bakhshalipour, Hamid Sarbazi-Azad
2018 arXiv   pre-print
In this paper, we propose Runahead Computing, a technique which uses idle threads in a multi-threaded architecture for accelerating the execution time of serial algorithms.  ...  Even though the number of cores and threads are pretty high and continues to grow, inherently serial algorithms do not benefit from the abundance of cores and threads.  ...  Some proposals resolve this problem with asymmetric architectures [24] . (2) Increasing the core count forces to replace non-scalable crossbars with on-chip networks which use scalable topologies (e.g  ... 
arXiv:1805.07269v1 fatcat:bgixjmnmljanjk4f3o26o5tfza

Approaching a parallelized XML parser optimized for multi-coreprocessors

Michael R. Head, Madhusudhan Govindaraju
2007 Proceedings of the 2007 workshop on Service-oriented computing performance: aspects, issues, and approaches - SOCP '07  
We take a well-known high performance parser, Piccolo, and apply two different strategies, Runahead and Piped, and examine the timing of the file read time and hence the overall time to parse large scientific  ...  Thus far, applications using Web services (in the grid community, for example) have largely focused on XML protocol standardization and tool building efforts, and not on addressing the performance bottlenecks  ...  We use a 683MB XML file representing a protein sequence database [14] located on the local hard drive to eliminate network traffic complications.  ... 
doi:10.1145/1272457.1272460 dblp:conf/hpdc/HeadG07 fatcat:4ooecfbfkfgxljfuihsuzymyyy

Selected Papers from the International Conference on Reconfigurable Computing and FPGAs (ReConFig'10)

Claudia Feregrino, Miguel Arias, Kris Gaj, Viktor K. Prasanna, Marco D. Santambrogio, Ron Sass
2012 International Journal of Reconfigurable Computing  
We would like to thank all the reviewers for their valuable time and effort in the review process and to provide constructive feedbacks to the authors.  ...  We thank all the authors who contributed to this Special Issue for submitting their manuscript and sharing their latest research results.  ...  Two papers are within the area of multiprocessor systems and networks on chip. In "Redsharc: A Programming Model and On-Chip Network for Multi-Core Systems on a Programmable Chip", W. V.  ... 
doi:10.1155/2012/319827 fatcat:konar3542ndydjcdtgdjk3vmx4

iCFP: Tolerating all-level cache misses in in-order processors

Andrew Hilton, Santosh Nagarakatte, Amir Roth
2009 2009 IEEE 15th International Symposium on High Performance Computer Architecture  
As a result, they have difficulties overlapping independent misses with one another. Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem.  ...  Cycle-level simulations show that iCFP out-performs Runahead, Multipass, and SLTP, another non-blocking in-order pipeline design.  ...  Acknowledgments We thank the reviewers for their comments on this submission. This work was supported by NSF grant CCF-0541292 and by a grant from the Intel Research Council.  ... 
doi:10.1109/hpca.2009.4798281 dblp:conf/hpca/HiltonNR09 fatcat:wh3xh44vgvhl3o624e5rxbqqzq

Accelerating asynchronous programs through event sneak peek

Gaurav Chadha, Scott Mahlke, Satish Narayanasamy
2015 Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15  
We observe that these programs perform poorly on conventional processor architectures that are heavily optimized for the characteristics of synchronous programs.  ...  ESP exploits the fact that events are posted to an event queue before they get executed. By exposing this event queue to the processor, ESP gains knowledge of the future events.  ...  Acknowledgements We would like to thank the anonymous reviewers for their valuable comments and feedback.  ... 
doi:10.1145/2749469.2750373 dblp:conf/isca/ChadhaMN15 fatcat:xlbr7bgwpjghfaphnyjiuym7fe

iCFP: Tolerating All-Level Cache Misses in In-Order Processors

Andrew Hilton, Santosh Nagarakatte, Amir Roth
2010 IEEE Micro  
As a result, they have difficulties overlapping independent misses with one another. Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem.  ...  Cycle-level simulations show that iCFP out-performs Runahead, Multipass, and SLTP, another non-blocking in-order pipeline design.  ...  Acknowledgments We thank the reviewers for their comments on this submission. This work was supported by NSF grant CCF-0541292 and by a grant from the Intel Research Council.  ... 
doi:10.1109/mm.2010.20 fatcat:nlv5v7gapnbwdnivb4qjorjwjy

Accelerating asynchronous programs through event sneak peek

Gaurav Chadha, Scott Mahlke, Satish Narayanasamy
2015 SIGARCH Computer Architecture News  
We observe that these programs perform poorly on conventional processor architectures that are heavily optimized for the characteristics of synchronous programs.  ...  ESP exploits the fact that events are posted to an event queue before they get executed. By exposing this event queue to the processor, ESP gains knowledge of the future events.  ...  Acknowledgements We would like to thank the anonymous reviewers for their valuable comments and feedback.  ... 
doi:10.1145/2872887.2750373 fatcat:57dkkolefvhzlhrr5jngtaeyqm

A simple latency tolerant processor

Satyanarayana Nekkalapu, Haitham Akkary, Komal Jothi, Renjith Retnamma, Xiaoyu Song
2008 2008 IEEE International Conference on Computer Design  
With relatively constant die sizes, limited on chip cache, and scarce pin bandwidth, more cores on chip reduces the amount of available cache and bus bandwidth per core, therefore exacerbating the memory  ...  The non-blocking property of this architecture provides tolerance to hundreds of cycles of cache miss latency on a simple in-order issue core, thus allowing many more such cores to be integrated on the  ...  INTRODUCTION Increased integration on a single chip has led to the current generation of multi-core processors having a few cores per chip.  ... 
doi:10.1109/iccd.2008.4751889 dblp:conf/iccd/NekkalapuAJRS08 fatcat:z6z3vdg4nreipjtu3xzgvexp5m

A Flexible Heterogeneous Multi-Core Architecture

Miquel Pericas, Adrian Cristal, Francisco J. Cazorla, Ruben Gonzalez, Daniel A. Jimenez, Mateo Valero
2007 Parallel Architecture and Compilation Techniques (PACT), Proceedings of the International Conference on  
Single-threaded applications can use the entire network of cores while multi-threaded applications can efficiently share the resources.  ...  In single-threaded mode this processor is able to outperform previous state-of-the-art high-performance processor research by 12% on SpecFP.  ...  Acknowlegements This work has been supported by the Ministerio de Educación y Ciencia of Spain under contract TIN-2004-07739-C02-01 and the HiPEAC European Network of Excellence (Framework Programme IST  ... 
doi:10.1109/pact.2007.4336196 fatcat:uuhhokn2bzcypkrk4ptz5izno4

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Rakesh Kumar, Mehdi Alipour, David Black-Schaffer
2019 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)  
While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy efficiency due to their complex hardware and the resulting energy overheads.  ...  To boost MLP generation in sOoO cores, we introduce Freeway, a sOoO core based on a new dependence-aware slice execution policy that tracks dependent slices and keeps them out of the way of MLP extraction  ...  Sniper works by extending Intel's PIN tool [12] with models for the core, memory hierarchy, and on-chip networks.  ... 
doi:10.1109/hpca.2019.00009 dblp:conf/hpca/KumarAB19 fatcat:atg4v7g6nnajvojfyspinqhfyq

Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous Architectures

Tae Jun Ham, Juan L. Aragón, Margaret Martonosi
2017 ACM Transactions on Architecture and Code Optimization (TACO)  
In adopting increased compute specialization, however, the relative amount of time spent on communication increases.  ...  System and software optimizations for communication often come at the costs of increased complexity and reduced portability.  ...  This work was supported in part by C-FAR, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA.  ... 
doi:10.1145/3075620 fatcat:4bhdk7qaevfgjcxuqindioupku

The load slice core microarchitecture

Trevor E. Carlson, Wim Heirman, Osman Allam, Stefanos Kaxiras, Lieven Eeckhout
2015 Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15  
Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly.  ...  The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in  ...  This work is supported by the European Research Council under the European Community's Seventh Framework Programme (FP7/2007-2013) / ERC Grant agreement no. 259295.  ... 
doi:10.1145/2749469.2750407 dblp:conf/isca/CarlsonHAKE15 fatcat:nptcpxrvxvh7xalyuy3a5nz4qm

A lifetime optimal algorithm for speculative PRE

Jingling Xue, Qiong Cai
2006 ACM Transactions on Architecture and Code Optimization (TACO)  
The key in achieving lifetime optimality lies not only in finding a unique minimum cut on a transformed graph of a given CFG, but also in performing a data-flow analysis directly on the CFG to avoid making  ...  A lifetime optimal algorithm, called MC-PRE, is presented for the first time that performs speculative PRE based on edge profiles.  ...  ACKNOWLEDGMENTS We wish to thank the reviewers and editors for their helpful comments and suggestions. This work is partially supported by an ARC grant DP0452623.  ... 
doi:10.1145/1138035.1138036 fatcat:6jxnqgxw6vbzpoefpqlt56pacm
« Previous Showing results 1 — 15 out of 95 results