A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Processing Data Where It Makes Sense: Enabling In-Memory Computation
[article]
2019
arXiv
pre-print
The emergence of 3D-stacked memory plus logic as well as the adoption of error correcting codes inside DRAM chips, and the necessity for designing new solutions to serious reliability and security issues ...
changes, (2) exploiting the logic layer in 3D-stacked memory technology to accelerate important data-intensive applications. ...
They deploy thousands of in-order, SIMT (Single Instruction Multiple Thread) cores that run lightweight threads. ...
arXiv:1903.03988v1
fatcat:l2sl2wqwmrejvfbi3sxrpwasby
A Modern Primer on Processing in Memory
[article]
2022
arXiv
pre-print
The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different ...
PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between ...
Recent versions of the talk were delivered as a distinguished lecture at George Washington University in February 2019 [468] , as an Invited Talk at ISSCC Special Forum on "Intelligence at the Edge: How ...
arXiv:2012.03112v2
fatcat:z7rsgn54mjbmncv5ezq4u3aqqm
Moving Processing to Data: On the Influence of Processing in Memory on Data Management
[article]
2019
arXiv
pre-print
Processing-in-Memory is a sub-class of Near-Data processing that targets data processing directly within memory (DRAM) chips. ...
The authors propose a heterogeneous co-design of hard-and software on basis of ARM cores and 3D stacked memory to schedule various NN training operations. ...
(b)
Fig. 3 3 Architecture of a 3D-stacked DRAM (based on [39, 92] ).
Fig. 4 4 PIM Architecture with 3D-stacked DRAM.
Fig. 5 5 JAFAR's Architecture Diagram (from [118] ). ...
arXiv:1905.04767v1
fatcat:xksczeu5jjfxhd4bzvaqpuivna
Beyond programmable shading
2008
ACM SIGGRAPH 2008 classes on - SIGGRAPH '08
Designing a graphics system for future 3D entertainment applications is particularly tricky because at a technical level the goals are ill defined. ...
., for explosions), and artificial intelligence for game play. It also allows these computations to be tightly integrated with the rendering computation. ...
programmable graphics-the new era of cooperatively using the CPU, GPU, and complex, dynamic data structures to efficiently synthesize imagesrequires new programming models, tools, and rendering systems that are designed ...
doi:10.1145/1401132.1401145
dblp:conf/siggraph/LefohnHBFFLO08
fatcat:jrt5e5373zairmf4fvn2tehaxa
Enhancing Programmability, Portability, and Performance with Rich Cross-Layer Abstractions
[article]
2019
arXiv
pre-print
In doing so, they enable a rich space of hardware-software cooperative mechanisms to optimize for performance. ...
This thesis makes the case for rich low-overhead cross-layer abstractions as a highly effective means to address the above challenges. ...
., the PC and the SIMT stack) is saved in memory. ...
arXiv:1911.05660v1
fatcat:w5f3g4isqbcphm2jjfzjtvrjnq
Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications
2022
ACM Transactions on Design Automation of Electronic Systems
To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization. ...
Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing, and speech recognition. ...
is vectorization (SIMD) and multi-threading (SIMT). ...
doi:10.1145/3486618
fatcat:h6xwv2slo5eklift2fl24usine
2018 IndexIEEE Transactions on Very Large Scale Integration (VLSI) SystemsVol. 26
2018
IEEE Transactions on Very Large Scale Integration (vlsi) Systems
., see 2723-2736
, VLSI Design of an ML-Based Power-Efficient Motion Estimation Controller for Intelligent Mobile Systems; TVLSI Feb. 2018 262-271 Hsieh, Y., see Tsai, Y., TVLSI May 2018 945-957 ...
Hsu, K., Chen, Y., Lee, Y., and Chang, S., Contactless Testing for Prebond Interposers; TVLSI June 2018 1005-1014 Hsu, Y., see Liu, Z., 1565-1574 Hu, J., see Wang, Y., TVLSI May 2018 805-817 Hu, J ...
., +, TVLSI July 2018 1290-1300
DRAM chips
Boosting NVDIMM Performance With a Lightweight Caching Algorithm. ...
doi:10.1109/tvlsi.2019.2892312
fatcat:rxiz5duc6jhdzjo4ybcxdajtbq
GPU power modeling and architectural enhancements for GPU energy efficiency
[article]
2019
Initially designed for 3D graphics, they evolved into general purpose accelerators, able to outperform CPUs on many tasks. The architecture of GPUs is optimized for massively parallel applications. ...
We continue with enhancements to improve the energy efficiency of the GPU cores. ...
Recently some GPUs have used 3D-stacked memory such as HBM and HBM2 [63] , [64] , [212] . ...
doi:10.14279/depositonce-7874
fatcat:wbmij23r2ngtfaskosnrsxt5gu
A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference
2021
IEEE Journal on Emerging and Selected Topics in Circuits and Systems
Deep neural networks (DNNs) are being prototyped for a variety of artificial intelligence (AI) tasks including computer vision, data analytics, robotics, etc. ...
techniques, and the realization of the micro-AI models on resource-constrained hardware and different design considerations associated with it. ...
in place to make them lightweight and suitable for micro-AI implementation. ...
doi:10.1109/jetcas.2021.3129415
fatcat:nknpy4eernaeljz2hpqafe7sja
Ray tracing from a data movement perspective
2016
Ray tracing is becoming more widely adopted in offline rendering systems due to its natural support for high quality lighting. ...
Since power consumption is one of the primary factors limiting the increase of processor performance, it must be addressed as a foremost concern in any future ray tracing system designs. ...
BVH traversal is handled with a special set of stack registers designated for stack nodes. ...
doi:10.26053/0h-mghs-0t00
fatcat:bx3lwpkbarc4ji75n2gvby3x3q
GPU computing architecture for irregular parallelism
2015
Second, Kilo TM is a cost effective, energy efficient solution for supporting transactional memory (TM) on GPUs. ...
Our evaluations show that TBC provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism on a set of GPU applications that suffer significantly from branch divergence ...
points as the per-warp SIMT stack in the baseline SIMT core. ...
doi:10.14288/1.0167110
fatcat:lk6u3fzl5fgcpnt2finxbdqfre
Efficient synchronization mechanisms for scalable GPU architectures
2020
Third, we design HMG, a hierarchical cache coherence protocol for multi-GPU systems. ...
The Graphics Processing Unit (GPU) has become a mainstream computing platform for a wide range of applications. ...
Unit
Last Level Cache
DRAM Controller
Memory Partition
Reg
File
SM Core
SIMT Stacks
Thread Block
Thread Block
L1 Data
Cache
Texture
Cache
Constant
Cache
Reg
File
SM Core ...
doi:10.14288/1.0394805
fatcat:aoyfyhwdyjbefp6yjyqk2p4tti
On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors
[article]
2021
By intelligently scheduling the execution of the producer and consumer in a software pipeline, evictions can be avoided, saving DRAM accesses, power, and performance. ...
These features allow for the optimization of codes that heretofore would be suitable only for multi-core CPUs or discrete GPUs to be run on a heterogeneous CPU-GPU microprocessor efficiently and in some ...
these SIMT cores for different vendors. ...
doi:10.13016/kwbs-up51
fatcat:vyh7wangjngulhpupg3pk6rog4
Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware
[article]
2021
The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate ...
This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability. ...
[222] introduce a 3D-stacked logic-in-memory system by placing logic layers between Dynamic Random Access Memory (DRAM) dies to accelerate a 3D-DRAM system for sparse data access and build a custom ...
doi:10.7302/2904
fatcat:zraktiwmczc7bkmqqxrvuxdiue
Using complexity to protect elections
2010
Communications of the ACM
Our petabyte would be a stack of 2×10 5 stone DVDs. A lot can happen to a stack that big in 100 years. ...
More recent multicore CPUs (such as the Intel Core2 Duo and Core i7) reflect a trend toward somewhat less-aggressive designs that expect a modest amount of parallelism. ...
You may assume for this purpose that you, your enemy, and the students are all slim enough to be considered points (viewed from above), rather than solid figures in 3D space. ...
doi:10.1145/1839676.1839696
fatcat:hbqpm5boabe3jcpa4jcs7czf6y
« Previous
Showing results 1 — 15 out of 19 results