1,384 Hits in 9.0 sec

Analyzing Behavior Specialized Acceleration

Tony Nowatzki, Karthikeyan Sankaralingam
2016 ACM SIGOPS Operating Systems Review  
For example, a 2-wide OOO processor with three BSAs matches the performance of a conventional 6-wide OOO core, has 40% lower area, and is 2.6× more energy efficient.  ...  Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with only certain properties, but remain largely configurable or programmable.  ...  Acknowledgments We thank Venkatraman Govindaraju for his help in creating the initial TDG models and validation for DySER and SIMD.  ... 
doi:10.1145/2954680.2872412 fatcat:66uy7l3ggbh6ze2mp33wtgmbtm

Automatic compilation to a coarse-grained reconfigurable system-opn-chip

Girish Venkataramani, Walid Najjar, Fadi Kurdahi, Nader Bagherzadeh, Wim Bohm, Jeff Hammes
2003 ACM Transactions on Embedded Computing Systems  
However, one of the obstacles to the wider acceptance of this technology is its programmability.  ...  We have compiled some important image-processing kernels, and the generated schedules reflect an average speed-up in execution times of up to 6x compared to the execution on 800 MHz Pentium III machines  ...  The compiler aims at exploiting fine-grained parallelism in applications by scheduling frequently executed instruction sequences (the trace-scheduling technique from VLIW compilers) for execution on the  ... 
doi:10.1145/950162.950167 fatcat:atgwub4vmnfmtekpsaxiot77ju

Software transparent dynamic binary translation for coarse-grain reconfigurable architectures

Matthew A. Watkins, Tony Nowatzki, Anthony Carno
2016 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)  
Course-grained reconfigurable architectures (CGRAs) are a class of architectures that provide a configurable grouping of functional units that aim to bridge the gap between the power and performance of  ...  The end of Dennard Scaling has forced architects to focus on designing for execution efficiency.  ...  ACKNOWLEDGMENTS We thank Karu Sankaralingam for his feedback as this work developed and for comments on draft versions of the paper.  ... 
doi:10.1109/hpca.2016.7446060 dblp:conf/hpca/WatkinsNC16 fatcat:ssmt2kzalba2xoozatcp6imlxq

The Good Block: Hardware/Software Design for Composable, Block-Atomic Processors

Bertrand A. Maher, Katherine E. Coons, Kathryn S. McKinley, Doug Burger
2011 2011 15th Workshop on Interaction between Compilers and Computer Architectures  
Although the architecture for a single size is simpler, the additions for variable sizes are modest and ease hardware configuration.  ...  Policies vary based on (1) the amount of parallelism inherent in the application, e.g., for integer and numerical applications, and (2) the available parallel resources.  ...  Any opinions, findings and conclusions expressed herein are the authors' and do not necessarily reflect those of the sponsors.  ... 
doi:10.1109/interact.2011.17 dblp:conf/IEEEinteract/MaherCMB11 fatcat:g2xvoea435gldesvdpkhyfm6rq

Serving Recurrent Neural Networks Efficiently with a Spatial Accelerator [article]

Tian Zhao, Yaqi Zhang, Kunle Olukotun
2019 arXiv   pre-print
We evaluate our optimization strategy on such abstraction with DeepBench using a configurable spatial accelerator.  ...  Such abstraction level enables a design space search that can lead to efficient usage of on-chip resources on a spatial architecture across a range of problem sizes.  ...  We thank Matthew Feldman for compiler support and his constructive suggestions on the manuscript of this paper, and Raghu Prabhakar for providing insights and feedback on the architecture section of this  ... 
arXiv:1909.13654v1 fatcat:6w2ccglyanfmrohqler55k2pzu

A Framework for Compiler Driven Design Space Exploration for Embedded System Customization [chapter]

Krishna V. Palem, Lakshmi N. Chakrapani, Sudhakar Yalamanchili
2004 Lecture Notes in Computer Science  
Designing custom solutions has been central to meeting a range of stringent and specialized needs of embedded computing, along such dimensions as physical size, power consumption, and performance that  ...  For this trend to continue, we must find ways to overcome the twin hurdles of rising non-recurring engineering (NRE) costs and decreasing time-to-market windows by providing major improvements in designer  ...  The Space of Target Architectures The spectrum of target architectures for embedded systems range from fine grained field programmable gate arrays (FPGAs) [32] at one end, through coarse grained "sea  ... 
doi:10.1007/978-3-540-30502-6_29 fatcat:aeceqkskxngnjjlbp4n247mcyi

The Deep Learning Compiler: A Comprehensive Survey [article]

Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, Depei Qian
2020 arXiv   pre-print
This is the first survey paper focusing on the design architecture of DL compilers, which we hope can pave the road for future research towards DL compiler.  ...  In this paper, we perform a comprehensive survey of existing DL compilers by dissecting the commonly adopted design in details, with emphasis on the DL oriented multi-level IRs, and frontend/backend optimizations  ...  ACKNOWLEDGEMENTS The authors would like to thank Jun Yang from Alibaba, Yu Xing from Xilinx, and Byung Hoon Ahn from UCSD for their valuable comments and suggestions.  ... 
arXiv:2002.03794v4 fatcat:owj6qygxhrhxjag5ifam65vhja

Modeling, analysis and exploration of layers: A 3D computing architecture

Zoltan Endre Rakossy
2014 2014 22nd International Conference on Very Large Scale Integration (VLSI-SoC)  
Acknowledgements This thesis is the result of my work as research assistant at the Institute for Communication Technologies and Embedded Systems (ICE), Multiprocessor System-on-Chip Architectures (MPSoC  ...  ) group, at the RWTH Aachen University.  ...  Coarse-grained Reconfigurable Architectures like ADRES [109] , RaPID [56] , Mor-phoSys [151] , RAW [156] , Montium [144] , IMEC coarse-grained accelerator [35] employ arrays of data word level reconfigurable  ... 
doi:10.1109/vlsi-soc.2014.7004167 dblp:conf/vlsi/Rakossy14 fatcat:6h3w3hfgdjeytitl7wnwrard34


Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, Scott Mahlke
2015 Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48  
On a system capable of switching between big and little cores rapidly with low overheads, DynaMOS schedules 38% of execution on the little on average, increasing utilization of the energy-efficient core  ...  In the context of a fine-grained heterogeneous multicore system composed of a big (OoO) core and a little (InO) core, we could offload recurring issue schedules from the big to the little core, to achieve  ...  Figure 14 illustrates the effects on little's utilization and performance as the STC size varies.  ... 
doi:10.1145/2830772.2830791 dblp:conf/micro/PadmanabhaLDM15 fatcat:ar52zdu34bcdbemkdesl46m6lm

Taurus: An Intelligent Data Plane [article]

Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, Kunle Olukotun
2020 arXiv   pre-print
On the long road to self-driving networks, Taurus is the equivalent of adaptive cruise control: deterministic rules steer flows, while machine learning tunes performance and heightens security.  ...  Taurus adds custom hardware based on a map-reduce abstraction to programmable network devices, such as switches and NICs; this new hardware uses pipelined and SIMD parallelism for fast inference.  ...  This also allows coarse-grained pipelining, where CUs perform operations and MUs act as pipeline registers.  ... 
arXiv:2002.08987v1 fatcat:6hxsnoqxxnglvewm56zl7uzine

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS [chapter]

Szilárd Páll, Mark James Abraham, Carsten Kutzner, Berk Hess, Erik Lindahl
2015 Lecture Notes in Computer Science  
Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance.  ...  we see for exascale simulation - in particular a very fine-grained task parallelism.  ...  Computational resources were provided by the Swedish National Infrastructure for computing (grants SNIC 025/12-32 & 2013-26/24) and the Leibniz Supercomputing Center.  ... 
doi:10.1007/978-3-319-15976-8_1 fatcat:rnekhgugmfe6zgpgwsybw4ywey

On Instruction-Level Method for Reducing Cache Penalties in Embedded VLIW Processors

Samir Ammenouche, Sid-Ahmed-Ali Touati, William Jalby
2009 2009 11th IEEE International Conference on High Performance Computing and Communications  
This article presents a back-end code optimisation for tolerating non-blocking cache effects at the instruction level (not at the loop level).  ...  Consequently, the code optimisations methods must be simple and take care of code size.  ...  Acknowledgements This research result has been supported by the ANR MOPUCE project (number 05-JCJC-0039) and the French Ministry of Industry.  ... 
doi:10.1109/hpcc.2009.32 dblp:conf/hpcc/AmmenoucheTJ09 fatcat:5swqekbrajdoffnny5fb75anke

Power consumption models for the use of dynamic and partial reconfiguration

R. Bonamy, S. Bilavarn, D. Chillet, O. Sentieys
2014 Microprocessors and microsystems  
Dynamic and Partial Reconfiguration (DPR) opens up promising prospects with the ability to reduce jointly performance and area of compute-intensive functions.  ...  Additionally, we illustrate the exploitation of these models to improve the analysis of DPR energy benefits in a realistic application example.  ...  Coarse Grained DPR Model Exploration results using the coarse grained model for the first execution of a hyper-period of the H264 decoder highlighted a best energy and performance DPR solution whose details  ... 
doi:10.1016/j.micpro.2014.01.002 fatcat:vibovaughrhrhow7a553gvtnfe

Measurement and analysis of GPU-accelerated applications with HPCToolkit

Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Grubisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, John Mellor-Crummey
2021 Parallel Computing  
To address the challenge of performance analysis on the US DOE's forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis  ...  We illustrate HPCToolkit's new capabilities for analyzing GPU-accelerated applications with several codes developed as part of the Exascale Computing Project.  ...  NVIDIA's CUPTI [35] supports both coarse-grained and fine-grained measurements for CUDA programs. AMD's ROCTracer [36] only supports coarse-grained measurements for HIP programs.  ... 
doi:10.1016/j.parco.2021.102837 fatcat:2lvlyngsmraofjkrfahwvl4cja

D9.2.2: Final Software Evaluation Report

Jose Carlos, Guillaume Colin de Verdière, Matthieu Hautreux, Giannis Koutsou
2012 Zenodo  
This deliverable reports on the latest software developments in high performance computing, as identified by the PRACE-1IP, WP9 members.  ...  With a view towards Exascale computing, we will present our results and findings for each of these topics, based on which we will conclude with a set of recommendations.  ...  A coarse-grained approach would involve taskifying an application's routines, and a finer-grained approach might be based on taskifying loops.  ... 
doi:10.5281/zenodo.6553027 fatcat:6vbrtqizm5eutmmskf44eltoqq
« Previous Showing results 1 — 15 out of 1,384 results