Filters








518 Hits in 4.7 sec

A scalable multi-path microarchitecture for efficient GPU control flow

Ahmed ElTantawy, Jessica Wenjie Ma, Mike O'Connor, Tor M. Aamodt
2014 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)  
These GPUs implement the SIMT execution model by serializing the execution of different control flow paths encountered by a warp.  ...  It also enables reconvergence before the immediate postdominator which is important for efficient execution of unstructured control flow.  ...  Rogers, Ayub Gubran, Hadi Jooybar and the reviewers for their insightful feedback.  ... 
doi:10.1109/hpca.2014.6835936 dblp:conf/hpca/ElTantawyMOA14 fatcat:topj6q33q5frznp463tf6usms4

An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels [article]

Nilanjan Goswami, Amer Qouneh, Chao Li, Tao Li
2020 arXiv   pre-print
Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs.  ...  Also, we demonstrate a multi-kernel throughput benchmark suite based on the framework that encapsulates symmetric, asymmetric and co-existing (often appears together) kernel based workloads.  ...  Figure 3 depicts the flow of operations for multi-kernel workload generation. Following subsections explain the process in detail.  ... 
arXiv:2011.02368v2 fatcat:xgce6gvcjjcilfwem452yd3hsi

FastLanes: An FPGA accelerated GPU microarchitecture simulator

Kuan Fang, Yufei Ni, Jiayuan He, Zonghui Li, Shuai Mu, Yangdong Deng
2013 2013 IEEE 31st International Conference on Computer Design (ICCD)  
In this paper, we propose FastLanes, an FPGA based simulator for a generic GPU microarchitecture, to enable hardware-accelerated simulation.  ...  Currently, GPU architecture research resorts to time-consuming software simulations to evaluate microarchitecture innovations.  ...  The second step is for the control flow, especially branches, in CUDA programs.  ... 
doi:10.1109/iccd.2013.6657049 dblp:conf/iccd/FangNHLMD13 fatcat:li4p6f2ebrfmlcx7m7u3sb7dc4

Exploiting New Interconnect Technologies in On-Chip Communication

John Kim, Kiyoung Choi, Gabriel Loh
2012 IEEE Journal on Emerging and Selected Topics in Circuits and Systems  
The conventional metal interconnect is limited, especially for global communication, and can not scale efficiently.  ...  The communication challenge is not only within a single chip but providing high bandwidth to the increasing number of cores from off-chip memory is also a challenge.  ...  For short-channels such as a 2D mesh topology, such flow control can be done efficiently but it is likely not appropriate for a global interconnect.  ... 
doi:10.1109/jetcas.2012.2201031 fatcat:3arzyh25zrcybaqc3sqlocus2q

Power Optimization Techniques for NOC

Abhinav Bijapur, BMS College of Engineering, Bengaluru
2020 International Journal of Engineering Research and  
The on-chip network has become a significant solution for the communication limitation of SoC (System-on-chip).  ...  This paper presents a detailed structure and verification of the router module and various power optimization techniques for NOC by restructuring the architecture.  ...  Path diversity: Many ways to get from one node to another. II. ROUTER MICROARCHITECTURE & FLOW CONTROL A.  ... 
doi:10.17577/ijertv9is070088 fatcat:fu4ezauspbhmloye3d7jkqrkim

Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications

Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, Tao Li
2010 IEEE International Symposium on Workload Characterization (IISWC'10)  
With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice.  ...  Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture.  ...  The authors acknowledge the UF HPC Center for providing computational resources. We also acknowledge anonymous reviewers for their valuable suggestions.  ... 
doi:10.1109/iiswc.2010.5649549 dblp:conf/iiswc/GoswamiSJL10 fatcat:jw66a5zr6bbvxo32mmezqi5u64

Pangaea

Henry Wong, Hong Wang, Anne Bracy, Ethan Schuchman, Tor M. Aamodt, Jamison D. Collins, Perry H. Wang, Gautham Chinya, Ankur Khandelwal Groen, Hong Jiang
2008 Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT '08  
Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3Dspecific graphics processing is used to build more generalpurpose GPU cores, and (2) a 3-instruction  ...  Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multicores, extending the current state-of-the-art CPU-GPU integration that physically  ...  Acknowledgments We would like to thank Prasoonkumar Surti, Chris Zou, Lisa Pearce, Xintian Wu, and Ed Grochowski for the productive collaboration throughout the Pangaea project.  ... 
doi:10.1145/1454115.1454125 dblp:conf/IEEEpact/WongBSACWCGJW08 fatcat:p37zbpaobza7pngzkxogk37fyy

Vortex: OpenCL Compatible RISC-V GPGPU [article]

Fares Elsabbagh, Blaise Tine, Priyadarshini Roshan, Ethan Lyons, Euna Kim, Da Eun Shim, Lingjun Zhu, Sung Kyu Lim, Hyesoon kim
2020 arXiv   pre-print
In this work, we present Vortex, a RISC-V General-Purpose GPU that supports OpenCL.  ...  In addition, OpenCL is currently the most widely adopted programming framework for heterogeneous platforms available on mainstream CPUs, GPUs, as well as FPGAs and custom DSP.  ...  Instructions Description wspawn %numW, %PC Spawn W new warps at PC tmc %numT Change the thread mask to activate threads split %pred Control flow divergence join Control flow reconvergence bar  ... 
arXiv:2002.12151v1 fatcat:uvuhcu7hbfbkneh3iph5v7cpvm

Energy Efficient Computing Systems: Architectures, Abstractions and Modeling to Techniques and Standards [article]

Rajeev Muralidhar and Renata Borovica-Gajic and Rajkumar Buyya
2020 arXiv   pre-print
This survey aims to bring these domains together and is composed of a systematic categorization of key aspects of building energy efficient systems - (a) specification - the ability to precisely specify  ...  Many research surveys have covered different aspects of techniques in hardware and microarchitecture across devices, servers, HPC, data center systems along with software, algorithms, frameworks for energy  ...  The dynamic coordination is implemented as a hierarchical control system for scalable communication and decentralized control.  ... 
arXiv:2007.09976v2 fatcat:enrfj2qgerhyteapwykxcb5pni

A Configurable Cloud-Scale DNN Processor for Real-Time AI

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel (+8 others)
2018 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)  
This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI.  ...  The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction.  ...  The scalar core provides the BW NPU's control flow, including dynamic input-dependent control flow, a critical requirement for certain models such as single-batch RNNs with variablelength timesteps.  ... 
doi:10.1109/isca.2018.00012 dblp:conf/isca/FowersOPMLLAHAG18 fatcat:qalwazqx7jcctkndjrqmhszccq

Simultaneous branch and warp interweaving for sustained GPU performance

Nicolas Brunie, Sylvain Collange, Gregory Diamos
2012 SIGARCH Computer Architecture News  
As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution.  ...  Single-Instruction Multiple-Thread (SIMT) microarchitectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize  ...  Acknowledgements We thank Michael Shebanow and Andy Glew for early discussions on the original idea, and Mourad Gouicem and Pierre Fortin for their help with the TMD  ... 
doi:10.1145/2366231.2337166 fatcat:54bjhptf3nainlp37egc627cru

Simultaneous branch and warp interweaving for sustained GPU performance

Nicolas Brunie, Sylvain Collange, Gregory Diamos
2012 2012 39th Annual International Symposium on Computer Architecture (ISCA)  
As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution.  ...  Single-Instruction Multiple-Thread (SIMT) microarchitectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize  ...  Acknowledgements We thank Michael Shebanow and Andy Glew for early discussions on the original idea, and Mourad Gouicem and Pierre Fortin for their help with the TMD  ... 
doi:10.1109/isca.2012.6237005 dblp:conf/isca/BrunieCD12 fatcat:kzqn3yyrxrb4lbmtzjbrcwlvxy

Programmable and Scalable Architecture for Graphics Processing Units [chapter]

Carlos S. de La Lama, Pekka Jääskeläinen, Jarmo Takala
2009 Lecture Notes in Computer Science  
In this paper we evaluate the suitability of Transport Triggered Architectures (TTA) as a basis for implementing GPUs.  ...  TTA improves scalability over the traditional VLIW-style architectures making it interesting for computationally intensive applications.  ...  This research was partially funded by the Academy of Finland, the Nokia Foundation and Finnish Center for International Mobility (CIMO).  ... 
doi:10.1007/978-3-642-03138-0_2 fatcat:74m7dlpfkbbuhj3ab5oxm5lkmu

I(Re)2-WiNoC: Exploring scalable wireless on-chip micronetworks for heterogeneous embedded many-core SoCs

Dan Zhao, Yi Wang, Hongyi Wu, Takamaro Kikkawa
2015 Digital Communications and Networks  
A region-aided routing scheme is further deigned and implemented to realize loop-free, minimum path cost and high scalability for irregular WiNoC infrastructure.  ...  Modern embedded SoC design uses a rapidly increasing number of processing units for ubiquitous computing, forming the so-called embedded many-core SoCs (McSoC).  ...  The flow control logic implements a distributed prioritized flow control strategy to deal with the data traffic on a channel and inside a RF node.  ... 
doi:10.1016/j.dcan.2015.01.003 fatcat:3zzdh5lreralvhjx2n4jbd7gkq

Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips

Johannes Hofmann, Jan Treibig, Georg Hager, Gerhard Wellein
2014 Proceedings of the 2014 Workshop on Workshop on programming models for SIMD/Vector processing - WPMVP '14  
Finally we discuss why GPU implementations perform much better for this specific algorithm.  ...  A special emphasis is put on the vector gather implementation on Intel Haswell and Knights Corner microarchitectures.  ...  Acknowledgments We thank IBM Research for giving Jan Treibig the opportunity for a scientific visit at the T.J.Watson Research Center, which was the starting point for this work.  ... 
doi:10.1145/2568058.2568068 dblp:conf/ppopp/HofmannTHW14 fatcat:w66hhteubrep3ekonywb4d6c2q
« Previous Showing results 1 — 15 out of 518 results