Filters








942 Hits in 6.2 sec

A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior

Joel Hestness, Stephen W. Keckler, David A. Wood
2014 2014 IEEE International Symposium on Workload Characterization (IISWC)  
This paper presents a detailed comparison of memory access behavior for parallel applications executing on each core type in tightly-controlled heterogeneous CPU-GPU processor simulation.  ...  CPU and GPU cores.  ...  In this paper, we presented the first detailed analysis of memory system behavior and effects for applications mapped to both CPU and GPU cores.  ... 
doi:10.1109/iiswc.2014.6983054 dblp:conf/iiswc/HestnessKW14 fatcat:k76obdosvfhi5aftguivbbyhbe

On latency in GPU throughput microarchitectures

Michael Andersch, Jan Lucas, Mauricio A. LvLvarez-Mesa, Ben Juurlink
2015 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)  
Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput.  ...  In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs.  ...  In the dynamic latency analysis, we used a GPU performance simulator and an exemplary workload to determine two key contributors to dynamic memory load latency, queueing and arbitration.  ... 
doi:10.1109/ispass.2015.7095801 dblp:conf/ispass/AnderschLAJ15 fatcat:bpu2rwqtmfbazfdduimty7zbay

Virtual Platform to Analyze the Security of a System on Chip at Microarchitectural Level

Quentin Forcioli, Jean-Luc Danger, Clementine Maurice, Lilian Bossuet, Florent Bruguier, Maria Mushtaq, David Novo, Loic France, Pascal Benoit, Sylvain Guilley, Thomas Perianin
2021 2021 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)  
The main objective is to create a virtual and open platform that simulates the behavior of microarchitectural features and their interactions with the peripherals, like accelerators and memories in emerging  ...  One typical example is the exploitation of cache memory which keeps track of the program execution and paves the way to side-channel (SCA) analysis and transient execution attacks like Meltdown and Spectre  ...  ACKNOWLEDGEMENTS The work presented in this paper was realized in the framework of the ARCHI-SEC project number ANR-19-CE39-0008-03 supported by the French "Agence Nationale de la Recherche".  ... 
doi:10.1109/eurospw54576.2021.00017 fatcat:ljhuwgh3ebb47ksi3bocapspmy

Enabling GPGPU Low-Level Hardware Explorations with MIAOW

Raghuraman Balasubramanian, Pradip Valathol, Karthikeyan Sankaralingam, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Prasad
2015 ACM Transactions on Architecture and Code Optimization (TACO)  
While useful for modeling first-order effects, these tools do not provide a detailed view of GPU microarchitecture and physical design.  ...  Today's tools for GPU analysis include simulators like GPGPU-Sim, Multi2Sim, and Barra.  ...  In concrete terms, MIAOW focuses on microarchitecture of compute units (CUs) and implements them in synthesizable Verilog RTL and leaves the memory hierarchy and memory controllers as behavioral (emulated  ... 
doi:10.1145/2764908 fatcat:utj6prgm2zcctlb36ikgejny2e

Query Co-Processing on Commodity Hardware

A. Ailamaki, N.K. Govindaraju, D. Manocha
2006 22nd International Conference on Data Engineering (ICDE'06)  
Furthermore, due to the increasing gap between the processor and memory speeds, analysis of memory and processor behaviors has become important.  ...  The inherent parallelism and the high memory bandwidth available in the GPUs can be used to accelerate many of the traditional algorithms by an order of magnitude as compared to CPU-based implementations  ...  Furthermore, due to the increasing gap between the processor and memory speeds, analysis of memory and processor behaviors has become important.  ... 
doi:10.1109/icde.2006.122 dblp:conf/icde/AilamakiGM06 fatcat:x3rdgytg3fcwlgld6ctgc32vxu

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training [article]

Youngeun Kwon, Yunjae Lee, Minsoo Rhu
2020 arXiv   pre-print
When prototyped on a real CPU-GPU system, Tensor Casting provides 1.9-21x improvements in training throughput compared to state-of-the-art approaches.  ...  In this paper, we first perform a detailed workload characterization study on training recommendations, root-causing sparse embedding layer training as one of the most significant performance bottlenecks  ...  For our memory-centric system, we utilize a pair of V100s to model the NMP-GPU system, where one of the GPUs emulates the behavior of our NMP-augmented disaggregated memory node.  ... 
arXiv:2010.13100v1 fatcat:kt7vrmg7ezhijgdsvoqjywwkye

GARDENIA: A Domain-specific Benchmark Suite for Next-generation Accelerators [article]

Zhen Xu, Xuhao Chen, Jie Shen, Yang Zhang, Cheng Chen, Canqun Yang
2018 arXiv   pre-print
Our characterization shows that GARDENIA exhibits irregular microarchitectural behavior which is quite different from structured workloads and straightforward-implemented graph benchmarks.  ...  do not apply state-of-the-art algorithms and/or optimization techniques.  ...  In fact, due to different features of MIC and GPU, irregular workloads running on them have significantly different microarchitecture behaviors.  ... 
arXiv:1708.04567v4 fatcat:qlem3aokhvg5bd22bonzceaazq

NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units [article]

Bongjoon Hyun, Youngeun Kwon, Yujeong Choi, John Kim, Minsoo Rhu
2019 arXiv   pre-print
Through a careful data-driven application characterization study, we root-cause several limitations of prior GPU-centric address translation schemes and propose a memory management unit (MMU) that is tailored  ...  To satisfy the compute and memory demands of deep neural networks, neural processing units (NPUs) are widely being utilized for accelerating deep learning algorithms.  ...  As GPUs evolved into having a proper memory management unit (MMU) [1] , [2] , [3] , programmers are now given the illusion of a unified CPU-GPU memory address [4] , [5] allowing CPU and GPU to share  ... 
arXiv:1911.06859v1 fatcat:pyzkc6lh55gslf3kzzgseddt5q

Dark Silicon and the End of Multicore Scaling

H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, D. Burger
2012 IEEE Micro  
The multicore designs we study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies.  ...  Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF.  ... 
doi:10.1109/mm.2012.17 fatcat:ycpm5ytkarbvrfslewcz4eau4e

Dark silicon and the end of multicore scaling

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, Doug Burger
2011 SIGARCH Computer Architecture News  
The multicore designs we study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies.  ...  Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF.  ... 
doi:10.1145/2024723.2000108 fatcat:xsb4bh3wmvhwxmbkfucblpc3c4

Dark silicon and the end of multicore scaling

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, Doug Burger
2011 Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11  
The multicore designs we study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies.  ...  Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF.  ... 
doi:10.1145/2000064.2000108 dblp:conf/isca/EsmaeilzadehBASB11 fatcat:jjxyd4yq2rdszbjskujed3xbxa

Power Limitations and Dark Silicon Challenge the Future of Multicore

Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, Doug Burger
2012 ACM Transactions on Computer Systems  
The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies.  ...  Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%.  ...  We believe this study makes the case for innovation's urgency and its potential for high impact while providing a model that can be adopted as a tool by researchers and engineers to study limits of their  ... 
doi:10.1145/2324876.2324879 fatcat:ydudmzl3mbhtjjrzxesodcvlpq

GPUWattch

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, Vijay Janapa Reddi
2013 Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13  
To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs.  ...  We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies.  ...  We thank Steve Keckler and John Edmondson for helpful discussions on the challenges of power modeling of GPUs.  ... 
doi:10.1145/2485922.2485964 dblp:conf/isca/LengHEGKAR13 fatcat:bkfi476bf5ed5lalls522mmd64

GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures

Jingwen Leng, Yazhou Zu, Vijay Janapa Reddi
2015 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)  
The GPU's manycore nature complicates the voltage noise phenomenon, and its distinctive architecture features from the CPU necessitate a GPU-specific voltage noise analysis.  ...  Third, on the basis of our categorization and characterization, we propose a hierarchical voltage smoothing mechanism that mitigates each type of voltage droop.  ...  The views expressed in this paper are those of the authors only and do not reflect the official policy or position of the NSF or the U.S. Government.  ... 
doi:10.1109/hpca.2015.7056030 dblp:conf/hpca/LengZR15 fatcat:zuuvu4mbmvccbkrkhpnwuqolmq

An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels [article]

Nilanjan Goswami, Amer Qouneh, Chao Li, Tao Li
2020 arXiv   pre-print
Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs.  ...  On average, our analysis reveals that spatial and temporal concurrency within kernel execution in throughput architectures saves energy consumption by 32%, 26% and 33% in GTX470, Tesla M2050 and Tesla  ...  Power Efficiency and Occupancy Analysis Comparative analysis of occupancy (% core utilization) in Figure 21 reveals that, on average, M2050 and K20 achieve 91% and 83% more occupancy compared to GTX470  ... 
arXiv:2011.02368v2 fatcat:xgce6gvcjjcilfwem452yd3hsi
« Previous Showing results 1 — 15 out of 942 results