Filters








1,896 Hits in 3.7 sec

Practical off-chip meta-data for temporal memory streaming

Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, Andreas Moshovos
2009 2009 IEEE 15th International Symposium on High Performance Computer Architecture  
Unfortunately, current solutions for off-chip meta-data increase memory traffic by over a factor of three.  ...  Using these techniques, we develop Sampled Temporal Memory Streaming (STMS), a practical address-correlating prefetcher that keeps predictor meta-data in main memory while achieving 90% of the performance  ...  Acknowledgements The authors would like to thank Brian Gold and the anonymous reviewers for their feedback.  ... 
doi:10.1109/hpca.2009.4798239 dblp:conf/hpca/WenischFAFM09 fatcat:qzies3ngwjaetpsnel7mbbclkq

SCP: Synergistic cache compression and prefetching

Bhargavraj Patel, Nikos Hardavellas, Gokhan Memik
2015 2015 33rd IEEE International Conference on Computer Design (ICCD)  
track in main memory (e.g., Spatio-Temporal Memory Streaming-STEMS) and oversubscribe the already limited off-chip bandwidth.  ...  Utilizing the cache compression hardware to compress the storage arrays for a STEMS streaming engine, in addition to the data cache, allows the streaming engine to operate entirely on-chip using space  ...  SPATIO-TEMPORAL MEMORY STREAMING (STEMS) STeMS exploits both spatial and temporal locality to provide highly accurate data streaming [5] .  ... 
doi:10.1109/iccd.2015.7357098 dblp:conf/iccd/PatelHM15 fatcat:km72ggsgmnfb7opioskryvhsei

Temporal streams in commercial server applications

Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, Andreas Moshovos
2008 2008 IEEE International Symposium on Workload Characterization  
In this paper, we perform an information-theoretic analysis of miss traces from single-chip and multi-chip multiprocessors to identify recurring temporal streams in web serving, online transaction processing  ...  To improve memory system performance despite these challenging access patterns, researchers have proposed prefetchers that exploit temporal streams-recurring sequences of memory accesses.  ...  Acknowledgements The authors would like to thank the anonymous reviewers for their feedback on drafts of this paper.  ... 
doi:10.1109/iiswc.2008.4636095 dblp:conf/iiswc/WenischFAFM08 fatcat:lrbs7eel3naphk2o2vtae5etbq

B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors

David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, Daniel Jimenez
2014 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture  
For decades, the primary tools in alleviating the "Memory Wall" have been large cache hierarchies and data prefetchers.  ...  Both approaches, become more challenging in modern, Chip-multiprocessor (CMP) design.  ...  large amount of off-chip memory for meta-data storage (and the associated, energy consuming shuttling of large meta-data information on and off chip).  ... 
doi:10.1109/micro.2014.29 dblp:conf/micro/KadjoKSPGJ14 fatcat:rbk4bf4dfrfjnaxawb2f2yeqtu

Temporal instruction fetch streaming

Michael Ferdman, Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, Andreas Moshovos
2008 2008 41st IEEE/ACM International Symposium on Microarchitecture  
We propose Temporal Instruction Fetch Streaming (TIFS)-a mechanism for prefetching temporally-correlated instruction streams from lower-level caches.  ...  Then, we describe a practical mechanism to record these recurring sequences in the L2 cache and leverage them for instruction-cache prefetching.  ...  Gold, Nikolaos Hardavellas, Stephen Somogyi, and the anonymous reviewers for their feedback on drafts of this paper. This work was  ... 
doi:10.1109/micro.2008.4771774 dblp:conf/micro/FerdmanWAFM08 fatcat:ffdk7ljp6jbi5hqj2qrtfhrljm

An end-to-end testbed for scalable video streaming to mobile devices over HTTP

Yu-Sian Li, Chien-Chang Chen, Ting-An Lin, Cheng-Hsin Hsu, Yichuan Wang, Xin Liu
2013 2013 IEEE International Conference on Multimedia and Expo (ICME)  
For example, for 960x544 videos, our decoder achieves up to 20.72 FPS (Frame-Per-Second), and for 480x272 videos, it achieves up to 42.03 FPS.  ...  The decoder employs multiple decoder threads to leverage the multi-core CPUs, and the streaming server/client support adaptive HTTP video streaming.  ...  The description file contains the meta data, including the details of NAL units and their offsets in the files.  ... 
doi:10.1109/icme.2013.6607484 dblp:conf/icmcs/LiCLHWL13 fatcat:jd5h37epjvennj5gn3ju7nihnu

Design framework for an energy-efficient binary convolutional neural network accelerator based on nonvolatile logic

Daisuke Suzuki, Takahiro Oka, Akira Tamakoshi, Yasuhiro Takako, Takahiro Hanyu
2021 Nonlinear Theory and Its Applications IEICE  
Since all the data are stored in the nonvolatile devices, a power-gating technique can be fully utilized and standby power consumption of idle function blocks is eliminated.  ...  In Section 4, an EDA tool flow for the BCNN accelerator for the evaluation is described.  ...  We also thank to Silicon Artist Technology Co. for the technical assistance.  ... 
doi:10.1587/nolta.12.695 fatcat:7ihnfaccqfdevoq4446hvupwva

Model-Based Passive Testing of Safety-Critical Components [chapter]

Stefan Gruner, Bruce Watson
2011 Model-Based Testing for Embedded Systems  
before deployment is either theoretically impossible or practically not feasible.  ...  For some types of systems, for example dynamic or adaptive distributed systems which are able to re-configure themselves at runtime in response to changes in their environments, exhaustive active testing  ...  The additional testing components are themselves delay-insensitive and may be floorplanned anywhere, including off-chip for cost-savings.  ... 
doi:10.1201/b11321-17 fatcat:jipuis6gczdjtk4qbmjkt6v36e

Chip multi-processor generator

Alex Solomatnikov, Amin Firoozshahian, Wajahat Qadeer, Ofer Shacham, Kyle Kelley, Zain Asgar, Megan Wachs, Rehan Hameed, Mark Horowitz
2007 Proceedings - Design Automation Conference  
The amount of resources in a programmable platform (e.g., compute engines, instruction and data caches, processor width, memory bandwidth, etc.) is never optimal for any particular application.  ...  Similarly, our previous study, Stanford Smart Memories (SSM), showed that it is possible to build a reconfigurable chip multiprocessor memory system that can be customized for specific application needs  ...  Different memory models were created by changing the meaning of the meta-data bits, along with the protocols for actions taken when specific conditions occur.  ... 
doi:10.1145/1278480.1278544 dblp:conf/dac/SolomatnikovFQSKAWHH07 fatcat:r5cfnoxqarg5lghtnmqidi7wxy

Enhancing effective throughput for transmission line-based bus

Aaron Carpenter, Jianyun Hu, Ovunc Kocabas, Michael Huang, Hui Wu
2012 SIGARCH Computer Architecture News  
Main-stream general-purpose microprocessors require a collection of high-performance interconnects to supply the necessary data movement.  ...  However, shared-medium designs are perceived as only a niche solution for small-to medium-scale chips.  ...  Under this definition, the baseline bus has an average of 51% for the meta bus and 58% for the data bus.  ... 
doi:10.1145/2366231.2337178 fatcat:yhxf6dtu2zfw7p44lwjw4u2a6i

Enhancing effective throughput for transmission line-based bus

Aaron Carpenter, Jianyun Hu, Ovunc Kocabas, Michael Huang, Hui Wu
2012 2012 39th Annual International Symposium on Computer Architecture (ISCA)  
Main-stream general-purpose microprocessors require a collection of high-performance interconnects to supply the necessary data movement.  ...  However, shared-medium designs are perceived as only a niche solution for small-to medium-scale chips.  ...  Under this definition, the baseline bus has an average of 51% for the meta bus and 58% for the data bus.  ... 
doi:10.1109/isca.2012.6237015 dblp:conf/isca/CarpenterHKHW12 fatcat:u3v7nq6qf5c7hi7b4brvi6aidi

Revisiting sorting for GPGPU stream architectures

Duane G. Merrill, Andrew S. Grimshaw
2010 Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10  
This report presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors.  ...  We require 38% fewer bytes to be moved through the global memory subsystem and a 64% reduction in the number of thread-cycles needed for computation.  ...  running on a particular thread multiprocessor, and a large off-chip global device memory that is accessible to all threads.  ... 
doi:10.1145/1854273.1854344 dblp:conf/IEEEpact/MerrillG10 fatcat:dqoxj7ijbvfqhgb3spha5rzt5y

On-Chip Mechanisms to Reduce Effective Memory Access Latency [article]

Milad Hashemi
2016 arXiv   pre-print
Independent cache misses have all of the source data that is required to generate the address of the memory access available on-chip, while dependent cache misses depend on data that is located off-chip  ...  This dissertation proposes that dependent cache misses are accelerated by migrating the dependence chain that generates the address of the memory access to the memory controller for execution.  ...  Some prefetching proposals use large off-chip storage to reduce the need for on-chip storage [27, 73] . These proposals incur the additional cost of transmitting meta-data over the memory bus.  ... 
arXiv:1609.00306v1 fatcat:hh2lxatnhfdz5mekvmim2p5a24

Minimalist open-page

Dimitris Kaseridis, Jeffrey Stuecheli, Lizy Kurian John
2011 Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 '11  
We use a fair memory hashing scheme to control the maximum number of page mode hits, and direct the memory scheduler with processor-generated prefetch meta-data.  ...  DRAM's use is further complicated in many-core systems where the memory interface is shared among multiple cores/threads competing for memory bandwidth.  ...  The authors acknowledge the use of the Archer infrastructure for their simulations.  ... 
doi:10.1145/2155620.2155624 dblp:conf/micro/KaseridisSJ11 fatcat:isqtxw73zzga5gprid2gasrnfe

De-layered grid storage server

H. Shrikumar
2005 ACM SIGBED Review  
thus, for the first time, enabling transcontinental real-time and temporal grid computation and database applications.  ...  Instead, multiple layers of a protocol stack are compiled into a hardware engine that processes all layers concurrently on-chip.  ...  Thus, while the meta-data is managed by the CPU-based meta-data servers, the datablocks themselves flow directly from the IP attached disks into the wide-area Grid fabric; scalably bypassing von Neumann  ... 
doi:10.1145/1121802.1121806 fatcat:2pbfxn3xxvdgvmrhgp4dtefdiu
« Previous Showing results 1 — 15 out of 1,896 results