Filters








141 Hits in 3.4 sec

Compression architecture for bit-write reduction in non-volatile memory technologies

David B. Dgien, Poovaiah M. Palangappa, Nathan A. Hunter, Jiayin Li, Kartik Mohanram
2014 2014 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH)  
We examine two different compression methods for compressing each word in our architecture.  ...  In this thesis we explore a novel method for improving the performance and lifetime of non-volatile memory technologies.  ...  With STT-RAM, a candidate replacement for SRAM cache or embedded DRAM, archi- The major difficulty with implementing frequent value compression is determining exactly what the frequent values for the  ... 
doi:10.1109/nanoarch.2014.6880482 dblp:conf/nanoarch/DgienPHLM14 fatcat:txj4fmtbfjazhpwnskfkwfl2du

A case for core-assisted bottleneck acceleration in GPUs

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu
2015 Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15  
For example, when a GPU is bottlenecked by the available o -chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.  ...  We provide a comprehensive design and evaluation of CABA to perform e ective and exible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck.  ...  Acknowledgments We thank the reviewers for their valuable suggestions. We thank the members of the SAFARI group for their feedback and the stimulating research environment they provide.  ... 
doi:10.1145/2749469.2750399 dblp:conf/isca/VijaykumarPJ0AD15 fatcat:vow55cmt3zhlxmg5o3x2lxx6ri

A case for core-assisted bottleneck acceleration in GPUs

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu
2015 SIGARCH Computer Architecture News  
For example, when a GPU is bottlenecked by the available o -chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.  ...  We provide a comprehensive design and evaluation of CABA to perform e ective and exible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck.  ...  Acknowledgments We thank the reviewers for their valuable suggestions. We thank the members of the SAFARI group for their feedback and the stimulating research environment they provide.  ... 
doi:10.1145/2872887.2750399 fatcat:mdd25bfj25frrnazvn5aj2cfxm

Frugal ECC

Jungrae Kim, Michael Sullivan, Seong-Lyong Gong, Mattan Erez
2015 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15  
FECC compresses main memory at cache-block granularity, using any left over space to store ECC information.  ...  FECC relies on a new Coverage-oriented-Compression that we developed specifically for the modest compression needs of ECC and for floating-point data.  ...  ACKNOWLEDGMENTS The authors acknowledge the Texas Advanced Computing Center for providing HPC resources and the support of the Department of Energy under Award #B609478 and the National Science Foundation  ... 
doi:10.1145/2807591.2807659 dblp:conf/sc/KimSGE15 fatcat:zvfgkt2nw5catnm5yrz6hiwxpy

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps [article]

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Saugata Ghose, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu
2016 arXiv   pre-print
For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.  ...  We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck.  ...  Acknowledgments We thank the reviewers for their valuable suggestions. We thank the members of the SAFARI group for their feedback and the stimulating research environment they provide.  ... 
arXiv:1602.01348v1 fatcat:qbzuknzcyncrticap55x4i5dhi

Enhancing Programmability, Portability, and Performance with Rich Cross-Layer Abstractions [article]

Nandita Vijaykumar
2019 arXiv   pre-print
In doing so, they enable a rich space of hardware-software cooperative mechanisms to optimize for performance.  ...  This thesis makes the case for rich low-overhead cross-layer abstractions as a highly effective means to address the above challenges.  ...  First, for NEARBY sharing, the prefetcher is directed to simply prefetch the next cache line.  ... 
arXiv:1911.05660v1 fatcat:w5f3g4isqbcphm2jjfzjtvrjnq

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks [article]

Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, Reetuparna Das
2018 arXiv   pre-print
This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks.  ...  Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed.  ...  ACKNOWLEDGEMENTS We thank members of M-Bits research group for their feedback. This work was supported in part by the NSF CAREER-1652294 award, and Intel gift award.  ... 
arXiv:1805.03718v1 fatcat:d72fse5przg43h5ojhqydsl64i

Redesigning LSMs for Nonvolatile Memory with NoveLSM

Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
2018 USENIX Annual Technical Conference  
We utilize three key techniques -a byteaddressable skip list, direct mutability of persistent state, and opportunistic read parallelism -to deliver high performance across a range of workload scenarios  ...  Acknowledgements We thank the anonymous reviewers and Michio Honda (our shepherd) for their insightful comments.  ...  The key size (for all key-values) is set to 16 bytes and only the value size is varied. We turn off database compression to avoid any undue impact on the results, as done previously [30] .  ... 
dblp:conf/usenix/KannanBGAA18 fatcat:57lur3cmybe5bphre7w2dsboyu

AVid: Annotation driven video decoding for hybrid memories

Liviu Codrut Stancu, Luis Angel D. Bathen, Nikil Dutt, Alex Nicolau
2012 2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia  
This paper presents AVid, an annotation driven video decoding technique for hybrid memory subsystems.  ...  However, in order to take advantage of the many benefits in NVMs, software must account for their high write overheads.  ...  The data flow is controlled by a Direct Memory Access (DMA) unit. The main memory consists of DRAM and NVM.  ... 
doi:10.1109/estimedia.2012.6507022 dblp:conf/estimedia/StancuBDN12 fatcat:aonhl7fsgvao3mlirdrz2wtwne

SmartSAGE: Training Large-scale Graph Neural Networks using In-Storage Processing Architectures [article]

Yunjae Lee, Jinha Chung, Minsoo Rhu
2022 arXiv   pre-print
Given the large performance gap between DRAM and SSD, however, blindly utilizing SSDs as a direct substitute for DRAM leads to significant performance loss.  ...  Unfortunately, state-of-the-art ML frameworks employ an in-memory processing model which significantly hampers the productivity of ML practitioners as it mandates the overall working set to fit within DRAM  ...  These data structures are mapped to user-space memory address via memory-mapped (mmap) file I/O, which allows the most recently accessed pages to be buffered inside the OS managed page cache (i.e., stored  ... 
arXiv:2205.04711v1 fatcat:nvgvsja7r5c4zclfx6dvzx526q

Near-Memory Address Translation

Javier Picorel, Djordje Jevdjic, Babak Falsafi
2017 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)  
One of the reasons is that MonetDB uses dictionary compression for strings, which compresses better for larger scale factors [60] .  ...  For a direct-mapped configuration, the DRAM mapping interleaving policy is the same as the widely-used page-based policy, in which pages are split across different banks [158] .  ...  Hence, achieving the performance of an ideal MMU with zero overhead for page walks.  ... 
doi:10.1109/pact.2017.56 dblp:conf/IEEEpact/PicorelJF17 fatcat:zgsfj7v4pjazdcfb5hcyemndea

The Dirty-Block Index

Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry
2014 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)  
) Heterogeneous ECC for clean/dirty blocks.  ...  We demonstrate the bene ts of DBI by using it to simultaneously and e ciently implement three optimizations proposed by prior work: 1) Aggressive DRAM-aware writeback, 2) Bypassing cache lookups, and 3  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers for their valuable comments.  ... 
doi:10.1109/isca.2014.6853204 dblp:conf/isca/SeshadriBMGKM14 fatcat:kdpg4dcs4nblrk4vetpjnevev4

The dirty-block index

Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry
2014 SIGARCH Computer Architecture News  
) Heterogeneous ECC for clean/dirty blocks.  ...  We demonstrate the bene ts of DBI by using it to simultaneously and e ciently implement three optimizations proposed by prior work: 1) Aggressive DRAM-aware writeback, 2) Bypassing cache lookups, and 3  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers for their valuable comments.  ... 
doi:10.1145/2678373.2665697 fatcat:dagrpk4mkvc7lkxbycwv7fnf54

Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs [article]

Esha Choukse, Michael Sullivan, Mike O'Connor, Mattan Erez, Jeff Pool, David Nellans, Steve Keckler
2019 arXiv   pre-print
Increasing the effective GPU memory capacity enables us to run larger-memory-footprint HPC workloads and larger batch-sizes or models for DL workloads than current memory capacities would allow.  ...  Buddy Compression compresses GPU memory, splitting each compressed memory entry between high-speed device memory and a slower-but-larger disaggregated memory pool (or system memory).  ...  BPC has been shown to have high compression ratios for GPU benchmarks when applied for DRAM bandwidth compression. Compression Granularity.  ... 
arXiv:1903.02596v2 fatcat:f66tmngn3nalxc77nloqzfwi4e

Short-Circuiting Memory Traffic in Handheld Platforms

Praveen Yedlapalli, Nachiappan Chidambaram Nachiappan, Niranjan Soundararajan, Anand Sivasubramaniam, Mahmut T. Kandemir, Chita R. Das
2014 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture  
The shared cache is implemented as a direct-mapped structure, with multiple read and write ports, and multiple banks (with a bank size of 4MB), and the read/write/lookup latencies are modeled using CACTI  ...  For DRAM, we varied the memory throughput by varying the LPDDR configurations.  ... 
doi:10.1109/micro.2014.60 dblp:conf/micro/YedlapalliNSSKD14 fatcat:v2jdz3ts3fdszad2ycf2gntlr4
« Previous Showing results 1 — 15 out of 141 results