Filters








2,386 Hits in 4.7 sec

Comparing and combining read miss clustering and software prefetching

V.S. Pai, S.V. Adve
Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques  
A recent latency tolerance technique, read miss clustering, restructures code to send demand miss references in parallel to the underlying memory system.  ...  An alternate, widely-used latency tolerance technique is software prefetching, which initiates data fetches ahead of expected demand miss references by a certain distance.  ...  Acknowledgments We thank Vikram Adve, Keith Cooper, Chen Ding, Ken Kennedy, John Mellor-Crummey, Partha Ranganathan, and Willy Zwaenepoel for valuable comments on this work.  ... 
doi:10.1109/pact.2001.953310 dblp:conf/IEEEpact/PaiA01 fatcat:5sztnupbrnecnnxsvedm7etvri

Effectiveness of Dynamic Prefetching in Multiple-Writer Distributed Virtual Shared-Memory Systems

Magnus Karlsson, Per Stenström
1997 Journal of Parallel and Distributed Computing  
To tolerate the access latencies, we propose a novel prefetch approach and show how it can be integrated into the software-based coherence layer of a multiple-writer protocol.  ...  Based on detailed architectural simulations and seven scientific applications we find that our prefetch algorithm can remove a vast majority of the remote operations which improves the performance of all  ...  Acknowledgments The authors are indebted to Mats Brorsson of Lund University and Jan Jonsson of Chalmers University of Technology for their suggestions and comments and to Robert Fowler of Rice University  ... 
doi:10.1006/jpdc.1997.1333 fatcat:e736a5vfczdilhzjbbnjlrva3a

Cluster miss prediction for instruction caches in embedded networking applications

Ken Batcher, Robert Walker
2004 Proceedins of the 14th ACM Great Lakes symposium on VLSI - GLSVLSI '04  
By identifying the start of a cluster miss sequence and preparing an instruction buffer for the upcoming cache misses, the miss penalty can be reduced if a miss does occur.  ...  A sample industrial networking example is used to illustrate the effectiveness of this technique compared with other prefetch methods.  ...  With single cycle latency to compare the miss address to the history table and a cycle of overhead to read the trigger CPU Activity with CMPB CPU Activity Without Any Prefetch Ex cache line buffer  ... 
doi:10.1145/988952.989039 dblp:conf/glvlsi/BatcherW04 fatcat:mtxjgn2m4nae7ikfplpsf77lhy

Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine [article]

Andreas Kurth, Pirmin Vogel, Andrea Marongiu, Luca Benini
2018 arXiv   pre-print
Compared to the state of the art, our work improves accelerator performance for memory-intensive kernels by up to 4x and by up to 60% for irregular and regular memory access patterns, respectively.  ...  In this work, we present our SVM solution that avoids the majority of TLB misses with prefetching, supports parallel burst DMA transfers without additional buffers, and can be scaled with the workload  ...  Weinbuch for his work on multi-threaded TLB miss handling during his Master's Thesis.  ... 
arXiv:1808.09751v1 fatcat:qqseltfgabftteu6aj626ezs5e

Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine

Andreas Kurth, Pirmin Vogel, Andrea Marongiu, Luca Benini
2018 2018 IEEE 36th International Conference on Computer Design (ICCD)  
Compared to the state of the art, our work improves accelerator performance for memory-intensive kernels by up to 4× and by up to 60 % for irregular and regular memory access patterns, respectively.  ...  In this work, we present our SVM solution that avoids the majority of TLB misses with prefetching, supports parallel burst DMA transfers without additional buffers, and can be scaled with the workload  ...  Weinbuch for his work on multi-threaded TLB miss handling during his Master's Thesis.  ... 
doi:10.1109/iccd.2018.00052 dblp:conf/iccd/KurthVMB18 fatcat:or22l5iaqfajhhgsci43lo67py

A Hybrid Instruction Prefetching Mechanism for Ultra Low-Power Multicore Clusters

Maryam Payami, Erfan Azarkhish, Igor Loi, Luca Benini
2017 IEEE Embedded Systems Letters  
In addition, we designed our prefetcher and integrated it in a 4-cores cluster in 28nm FDSOI technology.  ...  In this paper, we propose a low-cost and energy efficient hybrid instruction-prefetching mechanism to be integrated with an Ultra-Low-Power (ULP) multi-core cluster.  ...  But for Group-2, software assistance is required to achieve high hit-rates. Therefore, a combination of NLP+SWP is proposed with software prefetch requests explicitly inserted in the code.  ... 
doi:10.1109/les.2017.2707978 fatcat:uxor5dqojre5ll7fgv7h5ixl5q

Design of a scalable multiprocessor architecture and its simulation

Der-Lin Pean, Chao-Chin Wu, Huey-Ting Chua, Cheng Chen
2001 Journal of Systems and Software  
and software mechanisms.  ...  So far, we have evaluated several vital issues of cluster-based multiprocessors on SEECMA including eective prefetching and replacement policies, and optimization of migratory sharing using both hardware  ...  This inter-clustering prefetching technique causes less local bus requests, because it issues fewer read miss requests, and this may improve the clustering system.  ... 
doi:10.1016/s0164-1212(01)00034-6 fatcat:ix6rj2i3grfbfgk3ha23bd4udq

The interaction of software prefetching with ILP processors in shared-memory systems

Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, Sarita V. Adve
1997 Proceedings of the 24th annual international symposium on Computer architecture - ISCA '97  
Current microprocessors aggressively exploit instructionlevel parallelism (ILP) through techniques such a s m ultiple issue, dynamic scheduling, and non-blocking reads.  ...  In particular, we seek to determine whether software prefetching can equalize the performance of sequential consistency (SC) and release consistency (RC).  ...  Our results also show that optimizations to cluster read misses for the ILP system (described in Section 3.4) can reduce the e ectiveness of software prefetching.  ... 
doi:10.1145/264107.264158 dblp:conf/isca/RanganathanPAA97 fatcat:y5gv372vqffczeh3zosoqh7pji

The interaction of software prefetching with ILP processors in shared-memory systems

Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, Sarita V. Adve
1997 SIGARCH Computer Architecture News  
Current microprocessors aggressively exploit instructionlevel parallelism (ILP) through techniques such a s m ultiple issue, dynamic scheduling, and non-blocking reads.  ...  In particular, we seek to determine whether software prefetching can equalize the performance of sequential consistency (SC) and release consistency (RC).  ...  Our results also show that optimizations to cluster read misses for the ILP system (described in Section 3.4) can reduce the e ectiveness of software prefetching.  ... 
doi:10.1145/384286.264158 fatcat:wvaxpdgn6vgxbilvi66lyk7kwu

Cache streamization for high performance stream processor

Nan Wu, Mei Wen, Ju Ren, Yi He, ChangQing Xun, Wei Wu, Chunyuan Zhang
2009 2009 International Conference on High Performance Computing (HiPC)  
For this problem, this paper developed a streamization cache whose performance is comparable to streaming memory but is more easy to use.  ...  Due to high bandwidth demand on memory system of stream applications, most of stream processors use software-managed streaming memory.  ...  On average, APD prefetch reduces 52% compulsory miss rates to Base Cache without prefetch, and 37% to Enhanced Cache with traditional prefetch policy.  ... 
doi:10.1109/hipc.2009.5433214 dblp:conf/hipc/WuWRHXWZ09 fatcat:gv6bs4lsljdwpkpe2u4dcbmwtu

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

George C. Caragea, Alexandros Tzannes, Fuat Keceli, Rajeev Barua, Uzi Vishkin
2011 International journal of parallel programming  
Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%.  ...  Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation.  ...  In Fig. 7 we compared the performance of the software-only RAP algorithm with configurations in which both hardware and software prefetching were enabled.  ... 
doi:10.1007/s10766-011-0163-8 fatcat:amlpvqswhrevddi2c7lqyus7qy

Reducing memory latency via non-blocking and prefetching caches

Tien-Fu Chen, Jean-Loup Baer
1992 Proceedings of the fifth international conference on Architectural support for programming languages and operating systems - ASPLOS-V  
A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch  ...  improved substantially by compiler optimizations such as instruction scheduling and register renaming.  ...  This work was supported by NSF Grants CCR-9101541 and CCR-8904190, and by Apple Computer.  ... 
doi:10.1145/143365.143486 dblp:conf/asplos/ChenB92 fatcat:fn6ch3rkpbacdadrkrfnc2tmua

Comprehensive hardware and software support for operating systems to exploit MP memory hierarchies

Chun Xia, J. Torrellas
1999 IEEE transactions on computers  
, and software data prefetching.  ...  We show that they have a largely complementary impact and that, when combined, speed up the operating system by an average of 40 percent.  ...  We also thank Tom Murphy, Perry Emrath, and Liuxi Yang for their help with the hardware and operating system, and Intel and IBM for their generous support. This work was supported in part by the U.S.  ... 
doi:10.1109/12.769432 fatcat:4hbtey6n6jfpbpqef73xu6rame

A Task-centric Memory Model for Scalable Accelerator Architectures

John Kelm, Daniel Johnson, Steven S. Lumetta, Matthew Frank, Sanjay Patel
2010 IEEE Micro  
We further show that, while software management may constrain speculative hardware prefetching into local caches, a common optimization, it does not constrain the more relevant use case of off-chip prefetching  ...  We evaluate coherence management policies related to the task-centric memory model and show that the overhead of maintaining a coherent view of memory in software can be minimal.  ...  Johnson, Aqeel Mahesri, and the anonymous referees for their input and feedback. John Kelm was partially supported by a fellowship from ATI/AMD.  ... 
doi:10.1109/mm.2010.1 fatcat:a7zgs53fmbdv5iz5n7dzsythv4

Reducing memory latency via non-blocking and prefetching caches

Tien-Fu Chen, Jean-Loup Baer
1992 SIGPLAN notices  
A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch  ...  improved substantially by compiler optimizations such as instruction scheduling and register renaming.  ...  This work was supported by NSF Grants CCR-9101541 and CCR-8904190, and by Apple Computer.  ... 
doi:10.1145/143371.143486 fatcat:pc7zrg4yg5ehxda4fkgqndigaq
« Previous Showing results 1 — 15 out of 2,386 results