A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2014; you can also visit the original URL.
The file type is application/pdf
.
Filters
A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior
2014
2014 IEEE International Symposium on Workload Characterization (IISWC)
This paper presents a detailed comparison of memory access behavior for parallel applications executing on each core type in tightly-controlled heterogeneous CPU-GPU processor simulation. ...
CPU and GPU cores. ...
In this paper, we presented the first detailed analysis of memory system behavior and effects for applications mapped to both CPU and GPU cores. ...
doi:10.1109/iiswc.2014.6983054
dblp:conf/iiswc/HestnessKW14
fatcat:k76obdosvfhi5aftguivbbyhbe
On latency in GPU throughput microarchitectures
2015
2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput. ...
In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs. ...
In the dynamic latency analysis, we used a GPU performance simulator and an exemplary workload to determine two key contributors to dynamic memory load latency, queueing and arbitration. ...
doi:10.1109/ispass.2015.7095801
dblp:conf/ispass/AnderschLAJ15
fatcat:bpu2rwqtmfbazfdduimty7zbay
Virtual Platform to Analyze the Security of a System on Chip at Microarchitectural Level
2021
2021 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)
The main objective is to create a virtual and open platform that simulates the behavior of microarchitectural features and their interactions with the peripherals, like accelerators and memories in emerging ...
One typical example is the exploitation of cache memory which keeps track of the program execution and paves the way to side-channel (SCA) analysis and transient execution attacks like Meltdown and Spectre ...
ACKNOWLEDGEMENTS The work presented in this paper was realized in the framework of the ARCHI-SEC project number ANR-19-CE39-0008-03 supported by the French "Agence Nationale de la Recherche". ...
doi:10.1109/eurospw54576.2021.00017
fatcat:ljhuwgh3ebb47ksi3bocapspmy
Enabling GPGPU Low-Level Hardware Explorations with MIAOW
2015
ACM Transactions on Architecture and Code Optimization (TACO)
While useful for modeling first-order effects, these tools do not provide a detailed view of GPU microarchitecture and physical design. ...
Today's tools for GPU analysis include simulators like GPGPU-Sim, Multi2Sim, and Barra. ...
In concrete terms, MIAOW focuses on microarchitecture of compute units (CUs) and implements them in synthesizable Verilog RTL and leaves the memory hierarchy and memory controllers as behavioral (emulated ...
doi:10.1145/2764908
fatcat:utj6prgm2zcctlb36ikgejny2e
Query Co-Processing on Commodity Hardware
2006
22nd International Conference on Data Engineering (ICDE'06)
Furthermore, due to the increasing gap between the processor and memory speeds, analysis of memory and processor behaviors has become important. ...
The inherent parallelism and the high memory bandwidth available in the GPUs can be used to accelerate many of the traditional algorithms by an order of magnitude as compared to CPU-based implementations ...
Furthermore, due to the increasing gap between the processor and memory speeds, analysis of memory and processor behaviors has become important. ...
doi:10.1109/icde.2006.122
dblp:conf/icde/AilamakiGM06
fatcat:x3rdgytg3fcwlgld6ctgc32vxu
Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training
[article]
2020
arXiv
pre-print
When prototyped on a real CPU-GPU system, Tensor Casting provides 1.9-21x improvements in training throughput compared to state-of-the-art approaches. ...
In this paper, we first perform a detailed workload characterization study on training recommendations, root-causing sparse embedding layer training as one of the most significant performance bottlenecks ...
For our memory-centric system, we utilize a pair of V100s to model the NMP-GPU system, where one of the GPUs emulates the behavior of our NMP-augmented disaggregated memory node. ...
arXiv:2010.13100v1
fatcat:kt7vrmg7ezhijgdsvoqjywwkye
GARDENIA: A Domain-specific Benchmark Suite for Next-generation Accelerators
[article]
2018
arXiv
pre-print
Our characterization shows that GARDENIA exhibits irregular microarchitectural behavior which is quite different from structured workloads and straightforward-implemented graph benchmarks. ...
do not apply state-of-the-art algorithms and/or optimization techniques. ...
In fact, due to different features of MIC and GPU, irregular workloads running on them have significantly different microarchitecture behaviors. ...
arXiv:1708.04567v4
fatcat:qlem3aokhvg5bd22bonzceaazq
NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units
[article]
2019
arXiv
pre-print
Through a careful data-driven application characterization study, we root-cause several limitations of prior GPU-centric address translation schemes and propose a memory management unit (MMU) that is tailored ...
To satisfy the compute and memory demands of deep neural networks, neural processing units (NPUs) are widely being utilized for accelerating deep learning algorithms. ...
As GPUs evolved into having a proper memory management unit (MMU) [1] , [2] , [3] , programmers are now given the illusion of a unified CPU-GPU memory address [4] , [5] allowing CPU and GPU to share ...
arXiv:1911.06859v1
fatcat:pyzkc6lh55gslf3kzzgseddt5q
Dark Silicon and the End of Multicore Scaling
2012
IEEE Micro
The multicore designs we study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. ...
Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. ...
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF. ...
doi:10.1109/mm.2012.17
fatcat:ycpm5ytkarbvrfslewcz4eau4e
Dark silicon and the end of multicore scaling
2011
SIGARCH Computer Architecture News
The multicore designs we study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. ...
Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. ...
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF. ...
doi:10.1145/2024723.2000108
fatcat:xsb4bh3wmvhwxmbkfucblpc3c4
Dark silicon and the end of multicore scaling
2011
Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11
The multicore designs we study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. ...
Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. ...
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF. ...
doi:10.1145/2000064.2000108
dblp:conf/isca/EsmaeilzadehBASB11
fatcat:jjxyd4yq2rdszbjskujed3xbxa
Power Limitations and Dark Silicon Challenge the Future of Multicore
2012
ACM Transactions on Computer Systems
The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. ...
Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. ...
We believe this study makes the case for innovation's urgency and its potential for high impact while providing a model that can be adopted as a tool by researchers and engineers to study limits of their ...
doi:10.1145/2324876.2324879
fatcat:ydudmzl3mbhtjjrzxesodcvlpq
GPUWattch
2013
Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13
To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. ...
We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. ...
We thank Steve Keckler and John Edmondson for helpful discussions on the challenges of power modeling of GPUs. ...
doi:10.1145/2485922.2485964
dblp:conf/isca/LengHEGKAR13
fatcat:bkfi476bf5ed5lalls522mmd64
GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures
2015
2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
The GPU's manycore nature complicates the voltage noise phenomenon, and its distinctive architecture features from the CPU necessitate a GPU-specific voltage noise analysis. ...
Third, on the basis of our categorization and characterization, we propose a hierarchical voltage smoothing mechanism that mitigates each type of voltage droop. ...
The views expressed in this paper are those of the authors only and do not reflect the official policy or position of the NSF or the U.S. Government. ...
doi:10.1109/hpca.2015.7056030
dblp:conf/hpca/LengZR15
fatcat:zuuvu4mbmvccbkrkhpnwuqolmq
An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels
[article]
2020
arXiv
pre-print
Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. ...
On average, our analysis reveals that spatial and temporal concurrency within kernel execution in throughput architectures saves energy consumption by 32%, 26% and 33% in GTX470, Tesla M2050 and Tesla ...
Power Efficiency and Occupancy Analysis Comparative analysis of occupancy (% core utilization) in Figure 21 reveals that, on average, M2050 and K20 achieve 91% and 83% more occupancy compared to GTX470 ...
arXiv:2011.02368v2
fatcat:xgce6gvcjjcilfwem452yd3hsi
« Previous
Showing results 1 — 15 out of 942 results