A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Flexible cache error protection using an ECC FIFO
2009
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
1 SC'09 ECC FIFO • Goal: to reduce on-chip ECC overhead -Two-tiered error protection • T1EC: light-weight on-chip error code • T2EC: strong error correcting code -Off-load T2EC overhead to FIFO in DRAM ...
T2EC
encoder
TAG T2EC
Next dirty line
comes
TAG T2EC
Tag/T2EC
buffered in
Coalesce
Buffer
Coalesce
Buffer
SC'09
23
Last Level Cache
DRAM
Rest of cache hierarchy
.
.
. ...
T2EC
encoder
Dirty line eviction to LLC
Coalesce
Buffer
SC'09
20
Last Level Cache
DRAM
Rest of cache hierarchy
.
.
.
ECC
FIFO
.
.
. ...
doi:10.1145/1654059.1654109
dblp:conf/sc/YoonE09
fatcat:4eovims2azfevb6kqsdpwgr5x4
LOT-ECC
2012
SIGARCH Computer Architecture News
Data and codes are localized to the same DRAM row to improve access efficiency. ...
We use system firmware to store correction codes in DRAM data memory and modify the memory controller to handle data mapping. ...
This trick is similar to Loh and Hill's optimization to store tag and data for a cache line in the same row in a large DRAM cache [23] . ...
doi:10.1145/2366231.2337192
fatcat:vnapfdejxvbarabq2t4qym4yoy
Phase change memory architecture and the quest for scalability
2010
Communications of the ACM
Buffer reorganizations reduce this delay and energy gap to 1.2× and 1.0×, using narrow rows to mitigate write energy as well as multiple rows to improve locality and write coalescing. ...
We propose architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6× slower and requires 2.2× more energy than a DRAM system. ...
and write only modified cache lines or words to the PCM array. ...
doi:10.1145/1785414.1785441
fatcat:yh27hmiuebhptckqztfutgyahu
TLB Coalescing for Multi-grained Page Migration in Hybrid Memory Systems
2020
IEEE Access
We manage large-capacity NVM using superpages, and use a relatively small size of DRAM to cache hot base pages within the superpages. ...
In response, we bind those contiguous hot pages together and migrate them to DRAM. We also propose multi-grained TLBs to coalesce multiple page address translations into a single TLB entry. ...
To address this problem, Tamp utilizes the Clflush instruction to manage the cache lines that related to migrating pages. ...
doi:10.1109/access.2020.2983065
fatcat:hohb7tsf2vap3nxkg6iqhex7m4
Selective GPU caches to eliminate CPU-GPU HW cache coherence
2016
2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPUuncacheable requests, and a CPU-GPU ...
Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache ...
Indeed, the minimum transfer size supported by DRAM is usually a cache line. ...
doi:10.1109/hpca.2016.7446089
dblp:conf/hpca/AgarwalNEWDK16
fatcat:yhbiq2c35vbfnmw6b34tm62doe
Exploring Modern GPU Memory System Design Challenges through Accurate Modeling
[article]
2018
arXiv
pre-print
that the simpler model discounts the importance of advanced memory system designs such as out-of-order memory access scheduling, while overstating the impact of more heavily researched areas like L1 cache ...
To determine if the cache has a true 32B line size or if the line size is still 128B (the cache line size in GPGPU-Sim's modeled Fermi coalescer is 128B), with 32B sectors [30] , [31] , we created an ...
(32 threads coalescer)
Volta coalescer (8 threads coalescer) + Fair memory issue
Shared Memory
Programmable-specified up to 96 KB
Adaptive (up to 96 KB)
L1 cache
32KB, 128B line, 4 ways, write-evict ...
arXiv:1810.07269v1
fatcat:7af3t5apu5fkxlrpgusxhudbhy
Architecting phase change memory as a scalable dram alternative
2009
Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09
Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. ...
A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. ...
Overheads are 0.2 percent and 3.1 percent of each cache line when tracking dirty lines and words, respectively. ...
doi:10.1145/1555754.1555758
dblp:conf/isca/LeeIMB09
fatcat:xljlyjauazhbrjdpg2qj4fsb2m
Architecting phase change memory as a scalable dram alternative
2009
SIGARCH Computer Architecture News
Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. ...
A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. ...
Overheads are 0.2 percent and 3.1 percent of each cache line when tracking dirty lines and words, respectively. ...
doi:10.1145/1555815.1555758
fatcat:7skpy5t75benrlddbifpkc4ms4
Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU
2021
Micromachines
According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make ...
The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. ...
When the inter-warp coalescing queue receives a request, its cache line will match the cache line already in the queue, coalescing requests from the same cache line. ...
doi:10.3390/mi12101262
pmid:34683312
fatcat:gaklfqahxjfd3bsdldss6236ni
Tackling Diversity and Heterogeneity by Vertical Memory Management
[article]
2017
arXiv
pre-print
The accuracy of our approach is verified by off-line profiling. ...
Using the data mining approach, we find the LLCH and LLCM applications should be coalesced together to share the cache quota, while LLCT and CCF applications should be coalesced respectively to share a ...
arXiv:1704.01198v1
fatcat:ocmw3c4qkjdvrdt473reshwo2u
i-MIRROR: A Software Managed Die-Stacked DRAM-Based Memory Subsystem
2015
2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Optimizing the problems of reducing cache tag area, reducing transfer bandwidth and improving hit latency altogether while using the die-stacked DRAM as hardware cache is extremely challenging. ...
Our evaluations show that the proposed hardwareassisted software-managed i-MIRROR scheme achieves an IPC improvement of 13% while consuming 6% less energy than prior state-of-the-art die-stacked caching ...
For writing Data A to the PFN 65,536, a separate request is generated from the system controller, and is sent to the off-chip DRAM as denoted by a dotted line. ...
doi:10.1109/sbac-pad.2015.34
dblp:conf/sbac-pad/RyooGCJ15
fatcat:i7gs7g2fqzgmrpnvvpkvhmeqbq
Going vertical in memory management
2014
SIGARCH Computer Architecture News
To handle diverse and dynamically changing memory and cache allocation needs, we augment existing "horizontal" cache/DRAM bank partitioning with vertical partitioning and explore the resulting multi-policy ...
Based on this correlation we derive several practical memory allocation rules that we integrate into a unified multi-policy framework to guide resources partitioning and coalescing for dynamic and diverse ...
Conventionally, there are two page-coloring based partitioning techniques, namely the cache partitioning and DRAM bank partitioning. ...
doi:10.1145/2678373.2665698
fatcat:l34247cxjjcbhdg6p54pz6hf54
Going vertical in memory management: Handling multiplicity by multi-policy
2014
2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)
To handle diverse and dynamically changing memory and cache allocation needs, we augment existing "horizontal" cache/DRAM bank partitioning with vertical partitioning and explore the resulting multi-policy ...
Based on this correlation we derive several practical memory allocation rules that we integrate into a unified multi-policy framework to guide resources partitioning and coalescing for dynamic and diverse ...
Conventionally, there are two page-coloring based partitioning techniques, namely the cache partitioning and DRAM bank partitioning. ...
doi:10.1109/isca.2014.6853214
dblp:conf/isca/LiuLCBCW14
fatcat:4glliuis4nfc7lx2tp3gkbtmea
Delegated persist ordering
2016
2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Instead, we propose delegated ordering, wherein ordering requirements are communicated explicitly to the PM controller, fully decoupling PM write ordering from volatile execution and cache management. ...
We briefly describe the most relevant of these new instructions: • clwb: Requests writeback of modified cache line to memory; clean copy of cache line may be retained. • pcommit: Ensures that stores that ...
These idioms rely on clflush to explicitly write back dirty lines, requiring hundreds of cycles to execute [28] and invalidating the cache line, incurring a compulsory miss upon the next access. ...
doi:10.1109/micro.2016.7783761
dblp:conf/micro/KolliRDSPLCW16
fatcat:qudsk3fy3zh4fljgldiyw5laei
Modeling Emerging Memory-Divergent GPU Applications
2019
IEEE computer architecture letters
The key issue is that these GPU applications are memory-intensive and have poor spatial locality, which implies that the loads of different threads commonly access different cache blocks. ...
Such memory-divergent applications quickly exhaust the number of misses the L1 cache can process concurrently, and thereby cripple the GPU's ability to use Memory-Level Parallelism (MLP) and Thread-Level ...
These per-thread requests are aggregated to cache requests by the coalescer. On a cache hit, the cache line is read by the SM. ...
doi:10.1109/lca.2019.2923618
fatcat:dsmgt2n7v5e63jyxulxx6ql4oe
« Previous
Showing results 1 — 15 out of 1,190 results