Filters








4,744 Hits in 4.6 sec

Exploiting a Computation Reuse Cache to Reduce Energy in Network Processors [chapter]

Bengu Li, Ganesh Venkatesh, Brad Calder, Rajiv Gupta
2005 Lecture Notes in Computer Science  
Caches on the other hand are meant to help latency not throughput in a traditional processor, and provide no additional throughput for a balanced network processor design.  ...  This is why most high end routers do not use caches for their data plane algorithms. In this paper we examine how to use a cache for a balanced high bandwidth network processor.  ...  This work was funded in part by NSF grant CNS-0509546, and grants from Microsoft and Intel Corporation to the University of California, San Diego and NSF grant CCF-0208756, and grants from Intel Corp.,  ... 
doi:10.1007/11587514_17 fatcat:awyd3uttwrdqjdocxpg3msj3ma

On the effectiveness of prefetching and reuse in reducing L1 data cache traffic

G. Surendra, Subhasis Banerjee, S. K. Nandy
2004 Proceedings of the 3rd workshop on Memory performance issues in conjunction with the 31st international symposium on computer architecture - WMPI '04  
and (ii) load Instruction Reuse (IR) -in reducing data cache traffic.  ...  matching engine found in many network processors.  ...  In this paper, we compare two techniques -prefetching and Instruction Reuse [19] -in terms of their ability to reduce L1 data cache traffic in a popular network IDS called Snort [16] .  ... 
doi:10.1145/1054943.1054955 dblp:conf/wmpi/SurendraBN04 fatcat:7526wjzttng7hmtpllpfktwxtm

On Improving Efficiency and Utilization of Last Level Cache in Multicore Systems

Yumna Zahid, Hina Khurshid, Zulfiqar Ali Memon
2018 Information Technology and Control  
Maintaining energy efficient system is a crucial challenge for multicore processors.  ...  With the increasing need of computational power the trend towards multicore processors is ubiquitous.  ...  LLC spend a larger fraction of their energy in the form of leakage energy and hence need techniques which work by turning off a part of the cache to reduce the leakage energy consumption.  ... 
doi:10.5755/j01.itc.47.3.18433 fatcat:pgrmyliv3ra5vjlkqqv3vhuudu

An Energy-Efficient Processor Architecture for Embedded Systems

J. Balfour, W.J. Dally, D. Black-Schaffer, V. Parikh, JongSoo Park
2008 IEEE computer architecture letters  
The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data.  ...  The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data.  ...  Data Supply The distributed and hierarchical data register organization exploits reuse and locality in computations to satisfy most references from the operand register files (ORFs) located at the inputs  ... 
doi:10.1109/l-ca.2008.1 fatcat:efpogiee7nhu7jlg22awd6c6hm

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

Abdullah Al Hasib, Lasse Natvig, Per Kjeldsberg, Juan Cebrián
2017 Journal of Low Power Electronics and Applications  
Data reuse exploration aims at reducing the pressure on the memory subsystem by exploiting the temporal locality in data accesses.  ...  In this paper, we investigate the effects on performance and energy from a data reuse methodology combined with parallelization and vectorization in multi-and many-core processors.  ...  Author Contributions: All authors contributed extensively to the work presented in this paper.  ... 
doi:10.3390/jlpea7010005 fatcat:grbddqazojasvgscajioyyrtsq

DianNao

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, Olivier Temam
2014 Proceedings of the 19th international conference on Architectural support for programming languages and operating systems - ASPLOS '14  
a small footprint of 3.02 mm 2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x.  ...  Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers).  ...  While a cache is an excellent storage structure for a general-purpose processor, it is a sub-optimal way to exploit reuse because of the cache access overhead (tag check, associativity, line size, speculative  ... 
doi:10.1145/2541940.2541967 dblp:conf/asplos/ChenDSWWCT14 fatcat:ersjbr5ovrbybifa3fzj322pbi

Low Power Coarse-Grained Reconfigurable Instruction Set Processor [chapter]

Francisco Barat, Murali Jayapala, Tom Vander Aa, Rudy Lauwereins, Geert Deconinck, Henk Corporaal
2003 Lecture Notes in Computer Science  
Preliminary results show that the presented coarse-grained processor can achieve on average 2.5x the performance of a RISC processor at an 18% overhead in energy consumption.  ...  In this paper, we present a novel coarse-grained reconfigurable processor and study its power consumption.  ...  Acknowledgements This work is in part supported by MESA under MEDEA+.  ... 
doi:10.1007/978-3-540-45234-8_23 fatcat:4usoc63ulra2df3jx6n6yunlxy

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks [article]

Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, Reetuparna Das
2018 arXiv   pre-print
This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks.  ...  Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed.  ...  This work was supported in part by the NSF CAREER-1652294 award, and Intel gift award.  ... 
arXiv:1805.03718v1 fatcat:d72fse5przg43h5ojhqydsl64i

Non-uniform power access in large caches with low-swing wires

Aniruddha N. Udipi, Naveen Muralimanohar, Rajeev Balasubramonian
2009 2009 International Conference on High Performance Computing (HiPC)  
The proposed mechanisms reduce cache bank energy by 42% while incurring a minor 1% drop in performance.  ...  While there have been a number of proposals to minimize energy consumption in the inter-bank network, very little attention has been paid to the optimization of intra-bank network power that contributes  ...  All of the above schemes do little to reduce energy in the Htree, a major contributor to cache energy.  ... 
doi:10.1109/hipc.2009.5433222 dblp:conf/hipc/UdipiMB09 fatcat:e4qmsg74wjcbzctvrr6dbgqzly

Toward application-specific memory reconfiguration for energy efficiency

Pietro Cicotti, Laura Carrington, Andrew Chien
2013 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing - E2SC '13  
The end of Dennard scaling has made energy-efficiency a critical challenge in the continued increase of computing performance.  ...  Finally, as a first step towards automatic reconfiguration, we explore application characterization via reuse distance as a guide to select the best memory hierarchy configuration; we show that reuse distance  ...  This work was supported in part by the DOE Office of Science through the Advanced Scientific Computing Research (ASCR) award titled "Thrifty: An Exascale Architecture for Energy-Proportional Computing"  ... 
doi:10.1145/2536430.2536434 dblp:conf/sc/CicottiCC13 fatcat:ssw2vucenzdm7fk2452j5p4z3i

Unified performance and power modeling of scientific workloads

Shuaiwen Leon Song, Kevin Barker, Darren Kerbyson
2013 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing - E2SC '13  
The end of Dennard scaling has made energy-efficiency a critical challenge in the continued increase of computing performance.  ...  Finally, as a first step towards automatic reconfiguration, we explore application characterization via reuse distance as a guide to select the best memory hierarchy configuration; we show that reuse distance  ...  This work was supported in part by the DOE Office of Science through the Advanced Scientific Computing Research (ASCR) award titled "Thrifty: An Exascale Architecture for Energy-Proportional Computing"  ... 
doi:10.1145/2536430.2536435 dblp:conf/sc/SongBK13 fatcat:al4dkkcccrettiv3cmaacktety

Exploiting temporal loads for low latency and high bandwidth memory

S. Kim, N. Vijaykrishnan, M. Kandemir, M.J. Irwin
2005 IEE Proceedings - Computers and digital Techniques  
The paper proposes a novel technique, called the 'temporal load cache architecture', to reduce load latencies and provide higher memory bandwidths.  ...  When a load is predicted to be temporal, the data predicted to be accessed by it are read early in the pipeline from a small temporal load cache that stores the temporal data.  ...  This is mainly due to the reduced activity in the clock network and instruction window (note that they are dominant consumers of dynamic energy in current highperformance processors [26] ).  ... 
doi:10.1049/ip-cdt:20045124 fatcat:gspxg53qa5cpboqc5vl5f2vnrq

Runtime-Aware Architectures: A First Approach

2014 Supercomputing Frontiers and Innovations  
ILP) in superscalar processors.  ...  In this paper, we introduce a first approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime's perspective.  ...  This work has been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2012-34557, the HiPEAC Network of Excellence, and by the European Research Council under the European  ... 
doi:10.14529/jsfi140102 fatcat:4bh33566cfbz7iylsf2ufppsfa

Location-aware cache management for many-core processors with deep cache hierarchy

Jongsoo Park, Richard M. Yoo, Daya S. Khudia, Christopher J. Hughes, Daehyun Kim
2013 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13  
Our instructions provide a 1.07× speedup and a 1.24× energy efficiency boost, on average, according to simulations on a 64-core system with private L1 and L2 caches.  ...  With a large shared L3 cache added, the benefits increase, providing 1.33× energy reduction on average.  ...  Acknowledgements The authors would like to thank Samantika Subramaniam and Rob F. Van der Wijngaart for discussion during the initial stage of our project.  ... 
doi:10.1145/2503210.2503224 dblp:conf/sc/ParkYKHK13 fatcat:yvtqvwtg3rbnbcfgdbamqq5dy4

Load Miss Prediction - Exploiting Power Performance Trade-offs

Konrad Malkowski, Greg Link, Padma Raghavan, Mary Jane Irwin
2007 2007 IEEE International Parallel and Distributed Processing Symposium  
However, cache hierarchies do not necessarily benefit sparse scientific computing codes, which tend to have limited data locality and reuse.  ...  We therefore propose a new memory architecture with a Load Miss Predictor (LMP), which includes a data bypass cache and a predictor table, to reduce access latencies by determining whether a load should  ...  This allows better efficiency to be maintained upon scaling to multiple processors where network latencies can dominate.  ... 
doi:10.1109/ipdps.2007.370536 dblp:conf/ipps/MalkowskiLRI07 fatcat:prn3i7s4yvgh5nmhnob5cdfmgu
« Previous Showing results 1 — 15 out of 4,744 results