Filters








19 Hits in 4.4 sec

Processing Data Where It Makes Sense: Enabling In-Memory Computation [article]

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun
2019 arXiv   pre-print
The emergence of 3D-stacked memory plus logic as well as the adoption of error correcting codes inside DRAM chips, and the necessity for designing new solutions to serious reliability and security issues  ...  changes, (2) exploiting the logic layer in 3D-stacked memory technology to accelerate important data-intensive applications.  ...  They deploy thousands of in-order, SIMT (Single Instruction Multiple Thread) cores that run lightweight threads.  ... 
arXiv:1903.03988v1 fatcat:l2sl2wqwmrejvfbi3sxrpwasby

A Modern Primer on Processing in Memory [article]

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun
2022 arXiv   pre-print
The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different  ...  PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between  ...  Recent versions of the talk were delivered as a distinguished lecture at George Washington University in February 2019 [468] , as an Invited Talk at ISSCC Special Forum on "Intelligence at the Edge: How  ... 
arXiv:2012.03112v2 fatcat:z7rsgn54mjbmncv5ezq4u3aqqm

Moving Processing to Data: On the Influence of Processing in Memory on Data Management [article]

Tobias Vincon, Andreas Koch, Ilia Petrov
2019 arXiv   pre-print
Processing-in-Memory is a sub-class of Near-Data processing that targets data processing directly within memory (DRAM) chips.  ...  The authors propose a heterogeneous co-design of hard-and software on basis of ARM cores and 3D stacked memory to schedule various NN training operations.  ...  (b) Fig. 3 3 Architecture of a 3D-stacked DRAM (based on [39, 92] ). Fig. 4 4 PIM Architecture with 3D-stacked DRAM. Fig. 5 5 JAFAR's Architecture Diagram (from [118] ).  ... 
arXiv:1905.04767v1 fatcat:xksczeu5jjfxhd4bzvaqpuivna

Beyond programmable shading

Aaron Lefohn, Mike Houston, Chas Boyd, Kayvon Fatahalian, Tom Forsyth, David Luebke, John Owens
2008 ACM SIGGRAPH 2008 classes on - SIGGRAPH '08  
Designing a graphics system for future 3D entertainment applications is particularly tricky because at a technical level the goals are ill defined.  ...  ., for explosions), and artificial intelligence for game play. It also allows these computations to be tightly integrated with the rendering computation.  ...  programmable graphics-the new era of cooperatively using the CPU, GPU, and complex, dynamic data structures to efficiently synthesize imagesrequires new programming models, tools, and rendering systems that are designed  ... 
doi:10.1145/1401132.1401145 dblp:conf/siggraph/LefohnHBFFLO08 fatcat:jrt5e5373zairmf4fvn2tehaxa

Enhancing Programmability, Portability, and Performance with Rich Cross-Layer Abstractions [article]

Nandita Vijaykumar
2019 arXiv   pre-print
In doing so, they enable a rich space of hardware-software cooperative mechanisms to optimize for performance.  ...  This thesis makes the case for rich low-overhead cross-layer abstractions as a highly effective means to address the above challenges.  ...  ., the PC and the SIMT stack) is saved in memory.  ... 
arXiv:1911.05660v1 fatcat:w5f3g4isqbcphm2jjfzjtvrjnq

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han
2022 ACM Transactions on Design Automation of Electronic Systems  
To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization.  ...  Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing, and speech recognition.  ...  is vectorization (SIMD) and multi-threading (SIMT).  ... 
doi:10.1145/3486618 fatcat:h6xwv2slo5eklift2fl24usine

2018 IndexIEEE Transactions on Very Large Scale Integration (VLSI) SystemsVol. 26

2018 IEEE Transactions on Very Large Scale Integration (vlsi) Systems  
., see 2723-2736 , VLSI Design of an ML-Based Power-Efficient Motion Estimation Controller for Intelligent Mobile Systems; TVLSI Feb. 2018 262-271 Hsieh, Y., see Tsai, Y., TVLSI May 2018 945-957  ...  Hsu, K., Chen, Y., Lee, Y., and Chang, S., Contactless Testing for Prebond Interposers; TVLSI June 2018 1005-1014 Hsu, Y., see Liu, Z., 1565-1574 Hu, J., see Wang, Y., TVLSI May 2018 805-817 Hu, J  ...  ., +, TVLSI July 2018 1290-1300 DRAM chips Boosting NVDIMM Performance With a Lightweight Caching Algorithm.  ... 
doi:10.1109/tvlsi.2019.2892312 fatcat:rxiz5duc6jhdzjo4ybcxdajtbq

GPU power modeling and architectural enhancements for GPU energy efficiency [article]

Jan Lucas, Technische Universität Berlin, Technische Universität Berlin, Ben Juurlink
2019
Initially designed for 3D graphics, they evolved into general purpose accelerators, able to outperform CPUs on many tasks. The architecture of GPUs is optimized for massively parallel applications.  ...  We continue with enhancements to improve the energy efficiency of the GPU cores.  ...  Recently some GPUs have used 3D-stacked memory such as HBM and HBM2 [63] , [64] , [212] .  ... 
doi:10.14279/depositonce-7874 fatcat:wbmij23r2ngtfaskosnrsxt5gu

A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference

Arnab Neelim Mazumder, Jian Meng, Hasib-Al Rashid, Utteja Kallakuri, Xin Zhang, Jae-sun Seo, Tinoosh Mohsenin
2021 IEEE Journal on Emerging and Selected Topics in Circuits and Systems  
Deep neural networks (DNNs) are being prototyped for a variety of artificial intelligence (AI) tasks including computer vision, data analytics, robotics, etc.  ...  techniques, and the realization of the micro-AI models on resource-constrained hardware and different design considerations associated with it.  ...  in place to make them lightweight and suitable for micro-AI implementation.  ... 
doi:10.1109/jetcas.2021.3129415 fatcat:nknpy4eernaeljz2hpqafe7sja

Ray tracing from a data movement perspective

Daniel Kopta
2016
Ray tracing is becoming more widely adopted in offline rendering systems due to its natural support for high quality lighting.  ...  Since power consumption is one of the primary factors limiting the increase of processor performance, it must be addressed as a foremost concern in any future ray tracing system designs.  ...  BVH traversal is handled with a special set of stack registers designated for stack nodes.  ... 
doi:10.26053/0h-mghs-0t00 fatcat:bx3lwpkbarc4ji75n2gvby3x3q

GPU computing architecture for irregular parallelism

Wilson Wai Lun Fung
2015
Second, Kilo TM is a cost effective, energy efficient solution for supporting transactional memory (TM) on GPUs.  ...  Our evaluations show that TBC provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism on a set of GPU applications that suffer significantly from branch divergence  ...  points as the per-warp SIMT stack in the baseline SIMT core.  ... 
doi:10.14288/1.0167110 fatcat:lk6u3fzl5fgcpnt2finxbdqfre

Efficient synchronization mechanisms for scalable GPU architectures

Xiaowei Ren
2020
Third, we design HMG, a hierarchical cache coherence protocol for multi-GPU systems.  ...  The Graphics Processing Unit (GPU) has become a mainstream computing platform for a wide range of applications.  ...  Unit Last Level Cache DRAM Controller Memory Partition Reg File SM Core SIMT Stacks Thread Block Thread Block L1 Data Cache Texture Cache Constant Cache Reg File SM Core  ... 
doi:10.14288/1.0394805 fatcat:aoyfyhwdyjbefp6yjyqk2p4tti

On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors [article]

Daniel Gerzhoy
2021
By intelligently scheduling the execution of the producer and consumer in a software pipeline, evictions can be avoided, saving DRAM accesses, power, and performance.  ...  These features allow for the optimization of codes that heretofore would be suitable only for multi-core CPUs or discrete GPUs to be run on a heterogeneous CPU-GPU microprocessor efficiently and in some  ...  these SIMT cores for different vendors.  ... 
doi:10.13016/kwbs-up51 fatcat:vyh7wangjngulhpupg3pk6rog4

Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware [article]

Subhankar Pal, University, My
2021
The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate  ...  This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability.  ...  [222] introduce a 3D-stacked logic-in-memory system by placing logic layers between Dynamic Random Access Memory (DRAM) dies to accelerate a 3D-DRAM system for sparse data access and build a custom  ... 
doi:10.7302/2904 fatcat:zraktiwmczc7bkmqqxrvuxdiue

Using complexity to protect elections

Piotr Faliszewski, Edith Hemaspaandra, Lane A. Hemaspaandra
2010 Communications of the ACM  
Our petabyte would be a stack of 2×10 5 stone DVDs. A lot can happen to a stack that big in 100 years.  ...  More recent multicore CPUs (such as the Intel Core2 Duo and Core i7) reflect a trend toward somewhat less-aggressive designs that expect a modest amount of parallelism.  ...  You may assume for this purpose that you, your enemy, and the students are all slim enough to be considered points (viewed from above), rather than solid figures in 3D space.  ... 
doi:10.1145/1839676.1839696 fatcat:hbqpm5boabe3jcpa4jcs7czf6y
« Previous Showing results 1 — 15 out of 19 results