A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
The runahead network-on-chip
2016
2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
The Runahead Network-On-Chip With more cores per chip multiprocessor and higher memory demands from applications, it is imperative that networks-on-chip (NoC) provides low-latency, power-efficient communication ...
We propose the Runahead NoC, a lightweight, lossy network that provides single-cycle hops. ...
Prioritization Schemes on Network on Chip Traditional Network-on-Chips employ simple arbitration strategies for packets. These strategies include round-robin or age-based arbitration. ...
doi:10.1109/hpca.2016.7446076
dblp:conf/hpca/LiMJ16
fatcat:ix2ppbfvwfadtgp6mtsyemw3xa
On-Chip Mechanisms to Reduce Effective Memory Access Latency
[article]
2016
arXiv
pre-print
Independent cache misses have all of the source data that is required to generate the address of the memory access available on-chip, while dependent cache misses depend on data that is located off-chip ...
Independent cache misses are accelerated using a new mode for runahead execution that only executes filtered dependence chains. ...
This distinction is made on the basis of whether all source data for the cache miss is available on-chip or off-chip. ...
arXiv:1609.00306v1
fatcat:hh2lxatnhfdz5mekvmim2p5a24
Parallelizing Bisection Root-Finding: A Case for Accelerating Serial Algorithms in Multicore Substrates
[article]
2018
arXiv
pre-print
In this paper, we propose Runahead Computing, a technique which uses idle threads in a multi-threaded architecture for accelerating the execution time of serial algorithms. ...
Even though the number of cores and threads are pretty high and continues to grow, inherently serial algorithms do not benefit from the abundance of cores and threads. ...
Some proposals resolve this problem with asymmetric architectures [24] . (2) Increasing the core count forces to replace non-scalable crossbars with on-chip networks which use scalable topologies (e.g ...
arXiv:1805.07269v1
fatcat:bgixjmnmljanjk4f3o26o5tfza
Approaching a parallelized XML parser optimized for multi-coreprocessors
2007
Proceedings of the 2007 workshop on Service-oriented computing performance: aspects, issues, and approaches - SOCP '07
We take a well-known high performance parser, Piccolo, and apply two different strategies, Runahead and Piped, and examine the timing of the file read time and hence the overall time to parse large scientific ...
Thus far, applications using Web services (in the grid community, for example) have largely focused on XML protocol standardization and tool building efforts, and not on addressing the performance bottlenecks ...
We use a 683MB XML file representing a protein sequence database [14] located on the local hard drive to eliminate network traffic complications. ...
doi:10.1145/1272457.1272460
dblp:conf/hpdc/HeadG07
fatcat:4ooecfbfkfgxljfuihsuzymyyy
Selected Papers from the International Conference on Reconfigurable Computing and FPGAs (ReConFig'10)
2012
International Journal of Reconfigurable Computing
We would like to thank all the reviewers for their valuable time and effort in the review process and to provide constructive feedbacks to the authors. ...
We thank all the authors who contributed to this Special Issue for submitting their manuscript and sharing their latest research results. ...
Two papers are within the area of multiprocessor systems and networks on chip. In "Redsharc: A Programming Model and On-Chip Network for Multi-Core Systems on a Programmable Chip", W. V. ...
doi:10.1155/2012/319827
fatcat:konar3542ndydjcdtgdjk3vmx4
iCFP: Tolerating all-level cache misses in in-order processors
2009
2009 IEEE 15th International Symposium on High Performance Computer Architecture
As a result, they have difficulties overlapping independent misses with one another. Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem. ...
Cycle-level simulations show that iCFP out-performs Runahead, Multipass, and SLTP, another non-blocking in-order pipeline design. ...
Acknowledgments We thank the reviewers for their comments on this submission. This work was supported by NSF grant CCF-0541292 and by a grant from the Intel Research Council. ...
doi:10.1109/hpca.2009.4798281
dblp:conf/hpca/HiltonNR09
fatcat:wh3xh44vgvhl3o624e5rxbqqzq
Accelerating asynchronous programs through event sneak peek
2015
Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
We observe that these programs perform poorly on conventional processor architectures that are heavily optimized for the characteristics of synchronous programs. ...
ESP exploits the fact that events are posted to an event queue before they get executed. By exposing this event queue to the processor, ESP gains knowledge of the future events. ...
Acknowledgements We would like to thank the anonymous reviewers for their valuable comments and feedback. ...
doi:10.1145/2749469.2750373
dblp:conf/isca/ChadhaMN15
fatcat:xlbr7bgwpjghfaphnyjiuym7fe
iCFP: Tolerating All-Level Cache Misses in In-Order Processors
2010
IEEE Micro
As a result, they have difficulties overlapping independent misses with one another. Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem. ...
Cycle-level simulations show that iCFP out-performs Runahead, Multipass, and SLTP, another non-blocking in-order pipeline design. ...
Acknowledgments We thank the reviewers for their comments on this submission. This work was supported by NSF grant CCF-0541292 and by a grant from the Intel Research Council. ...
doi:10.1109/mm.2010.20
fatcat:nlv5v7gapnbwdnivb4qjorjwjy
Accelerating asynchronous programs through event sneak peek
2015
SIGARCH Computer Architecture News
We observe that these programs perform poorly on conventional processor architectures that are heavily optimized for the characteristics of synchronous programs. ...
ESP exploits the fact that events are posted to an event queue before they get executed. By exposing this event queue to the processor, ESP gains knowledge of the future events. ...
Acknowledgements We would like to thank the anonymous reviewers for their valuable comments and feedback. ...
doi:10.1145/2872887.2750373
fatcat:57dkkolefvhzlhrr5jngtaeyqm
A simple latency tolerant processor
2008
2008 IEEE International Conference on Computer Design
With relatively constant die sizes, limited on chip cache, and scarce pin bandwidth, more cores on chip reduces the amount of available cache and bus bandwidth per core, therefore exacerbating the memory ...
The non-blocking property of this architecture provides tolerance to hundreds of cycles of cache miss latency on a simple in-order issue core, thus allowing many more such cores to be integrated on the ...
INTRODUCTION Increased integration on a single chip has led to the current generation of multi-core processors having a few cores per chip. ...
doi:10.1109/iccd.2008.4751889
dblp:conf/iccd/NekkalapuAJRS08
fatcat:z6z3vdg4nreipjtu3xzgvexp5m
A Flexible Heterogeneous Multi-Core Architecture
2007
Parallel Architecture and Compilation Techniques (PACT), Proceedings of the International Conference on
Single-threaded applications can use the entire network of cores while multi-threaded applications can efficiently share the resources. ...
In single-threaded mode this processor is able to outperform previous state-of-the-art high-performance processor research by 12% on SpecFP. ...
Acknowlegements This work has been supported by the Ministerio de Educación y Ciencia of Spain under contract TIN-2004-07739-C02-01 and the HiPEAC European Network of Excellence (Framework Programme IST ...
doi:10.1109/pact.2007.4336196
fatcat:uuhhokn2bzcypkrk4ptz5izno4
Freeway: Maximizing MLP for Slice-Out-of-Order Execution
2019
2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy efficiency due to their complex hardware and the resulting energy overheads. ...
To boost MLP generation in sOoO cores, we introduce Freeway, a sOoO core based on a new dependence-aware slice execution policy that tracks dependent slices and keeps them out of the way of MLP extraction ...
Sniper works by extending Intel's PIN tool [12] with models for the core, memory hierarchy, and on-chip networks. ...
doi:10.1109/hpca.2019.00009
dblp:conf/hpca/KumarAB19
fatcat:atg4v7g6nnajvojfyspinqhfyq
Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous Architectures
2017
ACM Transactions on Architecture and Code Optimization (TACO)
In adopting increased compute specialization, however, the relative amount of time spent on communication increases. ...
System and software optimizations for communication often come at the costs of increased complexity and reduced portability. ...
This work was supported in part by C-FAR, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA. ...
doi:10.1145/3075620
fatcat:4bhdk7qaevfgjcxuqindioupku
The load slice core microarchitecture
2015
Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. ...
The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in ...
This work is supported by the European Research Council under the European Community's Seventh Framework Programme (FP7/2007-2013) / ERC Grant agreement no. 259295. ...
doi:10.1145/2749469.2750407
dblp:conf/isca/CarlsonHAKE15
fatcat:nptcpxrvxvh7xalyuy3a5nz4qm
A lifetime optimal algorithm for speculative PRE
2006
ACM Transactions on Architecture and Code Optimization (TACO)
The key in achieving lifetime optimality lies not only in finding a unique minimum cut on a transformed graph of a given CFG, but also in performing a data-flow analysis directly on the CFG to avoid making ...
A lifetime optimal algorithm, called MC-PRE, is presented for the first time that performs speculative PRE based on edge profiles. ...
ACKNOWLEDGMENTS We wish to thank the reviewers and editors for their helpful comments and suggestions. This work is partially supported by an ARC grant DP0452623. ...
doi:10.1145/1138035.1138036
fatcat:6jxnqgxw6vbzpoefpqlt56pacm
« Previous
Showing results 1 — 15 out of 95 results