Filters








1,190 Hits in 3.3 sec

Flexible cache error protection using an ECC FIFO

Doe Hyun Yoon, Mattan Erez
2009 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09  
1 SC'09 ECC FIFO • Goal: to reduce on-chip ECC overhead -Two-tiered error protection • T1EC: light-weight on-chip error code • T2EC: strong error correcting code -Off-load T2EC overhead to FIFO in DRAM  ...  T2EC encoder TAG T2EC Next dirty line comes TAG T2EC Tag/T2EC buffered in Coalesce Buffer Coalesce Buffer SC'09 23 Last Level Cache DRAM Rest of cache hierarchy . . .  ...  T2EC encoder Dirty line eviction to LLC Coalesce Buffer SC'09 20 Last Level Cache DRAM Rest of cache hierarchy . . . ECC FIFO . . .  ... 
doi:10.1145/1654059.1654109 dblp:conf/sc/YoonE09 fatcat:4eovims2azfevb6kqsdpwgr5x4

LOT-ECC

Aniruddha N. Udipi, Naveen Muralimanohar, Rajeev Balsubramonian, Al Davis, Norman P. Jouppi
2012 SIGARCH Computer Architecture News  
Data and codes are localized to the same DRAM row to improve access efficiency.  ...  We use system firmware to store correction codes in DRAM data memory and modify the memory controller to handle data mapping.  ...  This trick is similar to Loh and Hill's optimization to store tag and data for a cache line in the same row in a large DRAM cache [23] .  ... 
doi:10.1145/2366231.2337192 fatcat:vnapfdejxvbarabq2t4qym4yoy

Phase change memory architecture and the quest for scalability

Benjamin C. Lee, Engin Ipek, Onur Mutlu, Doug Burger
2010 Communications of the ACM  
Buffer reorganizations reduce this delay and energy gap to 1.2× and 1.0×, using narrow rows to mitigate write energy as well as multiple rows to improve locality and write coalescing.  ...  We propose architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6× slower and requires 2.2× more energy than a DRAM system.  ...  and write only modified cache lines or words to the PCM array.  ... 
doi:10.1145/1785414.1785441 fatcat:yh27hmiuebhptckqztfutgyahu

TLB Coalescing for Multi-grained Page Migration in Hybrid Memory Systems

Xiaoyuan Wang, Haikun Liu, Xiaofei Liao, Hai Jin, Yu Zhang
2020 IEEE Access  
We manage large-capacity NVM using superpages, and use a relatively small size of DRAM to cache hot base pages within the superpages.  ...  In response, we bind those contiguous hot pages together and migrate them to DRAM. We also propose multi-grained TLBs to coalesce multiple page address translations into a single TLB entry.  ...  To address this problem, Tamp utilizes the Clflush instruction to manage the cache lines that related to migrating pages.  ... 
doi:10.1109/access.2020.2983065 fatcat:hohb7tsf2vap3nxkg6iqhex7m4

Selective GPU caches to eliminate CPU-GPU HW cache coherence

Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F. Wenisch, John Danskin, Stephen W. Keckler
2016 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)  
We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPUuncacheable requests, and a CPU-GPU  ...  Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache  ...  Indeed, the minimum transfer size supported by DRAM is usually a cache line.  ... 
doi:10.1109/hpca.2016.7446089 dblp:conf/hpca/AgarwalNEWDK16 fatcat:yhbiq2c35vbfnmw6b34tm62doe

Exploring Modern GPU Memory System Design Challenges through Accurate Modeling [article]

Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G. Rogers
2018 arXiv   pre-print
that the simpler model discounts the importance of advanced memory system designs such as out-of-order memory access scheduling, while overstating the impact of more heavily researched areas like L1 cache  ...  To determine if the cache has a true 32B line size or if the line size is still 128B (the cache line size in GPGPU-Sim's modeled Fermi coalescer is 128B), with 32B sectors [30] , [31] , we created an  ...  (32 threads coalescer) Volta coalescer (8 threads coalescer) + Fair memory issue Shared Memory Programmable-specified up to 96 KB Adaptive (up to 96 KB) L1 cache 32KB, 128B line, 4 ways, write-evict  ... 
arXiv:1810.07269v1 fatcat:7af3t5apu5fkxlrpgusxhudbhy

Architecting phase change memory as a scalable dram alternative

Benjamin C. Lee, Engin Ipek, Onur Mutlu, Doug Burger
2009 Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09  
Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing.  ...  A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system.  ...  Overheads are 0.2 percent and 3.1 percent of each cache line when tracking dirty lines and words, respectively.  ... 
doi:10.1145/1555754.1555758 dblp:conf/isca/LeeIMB09 fatcat:xljlyjauazhbrjdpg2qj4fsb2m

Architecting phase change memory as a scalable dram alternative

Benjamin C. Lee, Engin Ipek, Onur Mutlu, Doug Burger
2009 SIGARCH Computer Architecture News  
Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing.  ...  A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system.  ...  Overheads are 0.2 percent and 3.1 percent of each cache line when tracking dirty lines and words, respectively.  ... 
doi:10.1145/1555815.1555758 fatcat:7skpy5t75benrlddbifpkc4ms4

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Juan Fang, Zelin Wei, Huijing Yang
2021 Micromachines  
According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make  ...  The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall.  ...  When the inter-warp coalescing queue receives a request, its cache line will match the cache line already in the queue, coalescing requests from the same cache line.  ... 
doi:10.3390/mi12101262 pmid:34683312 fatcat:gaklfqahxjfd3bsdldss6236ni

Tackling Diversity and Heterogeneity by Vertical Memory Management [article]

Lei Liu
2017 arXiv   pre-print
The accuracy of our approach is verified by off-line profiling.  ...  Using the data mining approach, we find the LLCH and LLCM applications should be coalesced together to share the cache quota, while LLCT and CCF applications should be coalesced respectively to share a  ... 
arXiv:1704.01198v1 fatcat:ocmw3c4qkjdvrdt473reshwo2u

i-MIRROR: A Software Managed Die-Stacked DRAM-Based Memory Subsystem

Jee Ho Ryoo, Karthik Ganesan, Yao-Min Chen, Lizy Kurian John
2015 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)  
Optimizing the problems of reducing cache tag area, reducing transfer bandwidth and improving hit latency altogether while using the die-stacked DRAM as hardware cache is extremely challenging.  ...  Our evaluations show that the proposed hardwareassisted software-managed i-MIRROR scheme achieves an IPC improvement of 13% while consuming 6% less energy than prior state-of-the-art die-stacked caching  ...  For writing Data A to the PFN 65,536, a separate request is generated from the system controller, and is sent to the off-chip DRAM as denoted by a dotted line.  ... 
doi:10.1109/sbac-pad.2015.34 dblp:conf/sbac-pad/RyooGCJ15 fatcat:i7gs7g2fqzgmrpnvvpkvhmeqbq

Going vertical in memory management

Lei Liu, Yong Li, Zehan Cui, Yungang Bao, Mingyu Chen, Chengyong Wu
2014 SIGARCH Computer Architecture News  
To handle diverse and dynamically changing memory and cache allocation needs, we augment existing "horizontal" cache/DRAM bank partitioning with vertical partitioning and explore the resulting multi-policy  ...  Based on this correlation we derive several practical memory allocation rules that we integrate into a unified multi-policy framework to guide resources partitioning and coalescing for dynamic and diverse  ...  Conventionally, there are two page-coloring based partitioning techniques, namely the cache partitioning and DRAM bank partitioning.  ... 
doi:10.1145/2678373.2665698 fatcat:l34247cxjjcbhdg6p54pz6hf54

Going vertical in memory management: Handling multiplicity by multi-policy

Lei Liu, Yong Li, Zehan Cui, Yungang Bao, Mingyu Chen, Chengyong Wu
2014 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)  
To handle diverse and dynamically changing memory and cache allocation needs, we augment existing "horizontal" cache/DRAM bank partitioning with vertical partitioning and explore the resulting multi-policy  ...  Based on this correlation we derive several practical memory allocation rules that we integrate into a unified multi-policy framework to guide resources partitioning and coalescing for dynamic and diverse  ...  Conventionally, there are two page-coloring based partitioning techniques, namely the cache partitioning and DRAM bank partitioning.  ... 
doi:10.1109/isca.2014.6853214 dblp:conf/isca/LiuLCBCW14 fatcat:4glliuis4nfc7lx2tp3gkbtmea

Delegated persist ordering

Aasheesh Kolli, Jeff Rosen, Stephan Diestelhorst, Ali Saidi, Steven Pelley, Sihang Liu, Peter M. Chen, Thomas F. Wenisch
2016 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)  
Instead, we propose delegated ordering, wherein ordering requirements are communicated explicitly to the PM controller, fully decoupling PM write ordering from volatile execution and cache management.  ...  We briefly describe the most relevant of these new instructions: • clwb: Requests writeback of modified cache line to memory; clean copy of cache line may be retained. • pcommit: Ensures that stores that  ...  These idioms rely on clflush to explicitly write back dirty lines, requiring hundreds of cycles to execute [28] and invalidating the cache line, incurring a compulsory miss upon the next access.  ... 
doi:10.1109/micro.2016.7783761 dblp:conf/micro/KolliRDSPLCW16 fatcat:qudsk3fy3zh4fljgldiyw5laei

Modeling Emerging Memory-Divergent GPU Applications

Lu Wang, Magnus Jahre, Almutaz Adileh, Zhiying Wang, Lieven Eeckhout
2019 IEEE computer architecture letters  
The key issue is that these GPU applications are memory-intensive and have poor spatial locality, which implies that the loads of different threads commonly access different cache blocks.  ...  Such memory-divergent applications quickly exhaust the number of misses the L1 cache can process concurrently, and thereby cripple the GPU's ability to use Memory-Level Parallelism (MLP) and Thread-Level  ...  These per-thread requests are aggregated to cache requests by the coalescer. On a cache hit, the cache line is read by the SM.  ... 
doi:10.1109/lca.2019.2923618 fatcat:dsmgt2n7v5e63jyxulxx6ql4oe
« Previous Showing results 1 — 15 out of 1,190 results