618 Hits in 3.5 sec

Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS

Istvan Z. Reguly, Gihan R. Mudalige, Michael B. Giles
2018 IEEE Transactions on Parallel and Distributed Systems  
The approach is generally applicable to any stencil DSL that provides per loop data access information.  ...  We evaluate our approach on a number of applications, observing speedups of 2× on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, 3.5× on the linear solver TeaLeaf, and  ...  We acknowledge PRACE for awarding us access to resource Marconi based in Italy at Cineca.  ... 
doi:10.1109/tpds.2017.2778161 fatcat:cjoehtigwra6nm2qk4sznctq7i

Source-to-Source Automatic Differentiation of OpenMP Parallel Loops [article]

Jan Hückelheim, Laurent Hascoët
2021 arXiv   pre-print
The computational cost to compute gradients is a common bottleneck in practice.  ...  For applications that are parallelized for multicore CPUs or GPUs using OpenMP, one also wishes to compute the gradients in parallel.  ...  Stencil kernels are characterized by a loop that in each iteration updates indices in an output array based on neighboring indices in an input array, with a simple relationship between loop counter, input  ... 
arXiv:2111.01861v1 fatcat:m2xqlqcx4zblfobijnyoo4vmge

Loop coarsening in C-based High-Level Synthesis

Moritz Schmid, Oliver Reiche, Frank Hannig, Jurgen Teich
2015 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)  
Conversely, loop coarsening allows to process multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator.  ...  In addition to well known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability.  ...  On the RT level, our architecture for loop coarsening can be compared to Schmidt's VHDL templates for stencil computations [21] .  ... 
doi:10.1109/asap.2015.7245730 dblp:conf/asap/SchmidRHT15 fatcat:3qglqzjkczh6phu76qonr2mbxm

A scalable design approach for stencil computation on reconfigurable clusters

Xinyu Niu, Jose G. F. Coutinho, Wayne Luk
2013 2013 23rd International Conference on Field programmable Logic and Applications  
This paper proposes a scalable communication model to schedule communication operations based on available resources and algorithm properties.  ...  Stencil-based algorithms are known to be computationally intensive and used in many scientific applications.  ...  The proposed approach presented in this paper focuses on the following challenges: how to utilise available resources in each FPGA, and how to ensure linear scalability when multiple FPGAs are involved  ... 
doi:10.1109/fpl.2013.6645551 dblp:conf/fpl/NiuCL13 fatcat:p74k2r72g5duzdjwqknnqnq6ue

Automatic loop kernel analysis and performance modeling with Kerncraft

Julian Hammer, Georg Hager, Jan Eitzinger, Gerhard Wellein
2015 Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems - PMBS '15  
We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests.  ...  We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.  ...  Several microbenchmarks are available to provide a "closest match" to the actual loop code: e.g., if one read stream, one write stream, and one write-allocate stream hit a certain memory level, the measured  ... 
doi:10.1145/2832087.2832092 dblp:conf/sc/HammerHEW15 fatcat:vsyvvnthjne27di4tzt4qkcb6m

Reducing the burden of parallel loop schedulers for many‐core processors

Mahwish Arif, Hans Vandierendonck
2021 Concurrency and Computation  
This relates to the loop iteration count as well as the amount of work performed per iteration.  ...  As core counts in processors increases, it becomes harder to schedule and distribute work in a timely and scalable manner.  ...  The OpenMP language The OpenMP language 10 provides a directive-based approach to parallel programming. We focus on parallel loops with reductions.  ... 
doi:10.1002/cpe.6241 fatcat:4rluruunxjb4dehant4kjl354e

Evaluating reconfigurable dataflow computing using the Himeno benchmark

Yukinori Sato, Yasushi Inoguchi, Wayne Luk, Tadao Nakamura
2012 2012 International Conference on Reconfigurable Computing and FPGAs  
Heterogeneous computing using FPGA accelerators is a promising approach to boost the performance of application programs within given power consumption.  ...  This paper focuses on optimizations targeting FPGA-based reconfigurable dataflow computing platform, and shows how they benefit an application.  ...  Sano et al. implement a 2D stencil computation on FPGA arrays [10] . While this can achieve 260 GFLOPS by their scalable streaming array, they use 9 FPGA boards to implement it.  ... 
doi:10.1109/reconfig.2012.6416746 dblp:conf/reconfig/SatoILN12 fatcat:3faktxw77jehbja6gwhoordz64

Enabling OpenMP Task Parallelism on Multi-FPGAs [article]

R. Nepomuceno, R. Sterle, G. Valarini, M. Pereira, H. Yviquel, G. Araujo
2021 arXiv   pre-print
Experimental results for a set of OpenMP stencil applications running on a Multi-FPGA platform consisting of 6 Xilinx VC709 boards interconnected through fiber-optic links have shown close to linear speedups  ...  FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy  ...  Section IV shows how this architecture can be used to design a scalable stencil pipeline application. Section V describes the experimental setup and analyzes their results.  ... 
arXiv:2103.10573v2 fatcat:qjfa6bwkszfphmdy3mghbozjcy

One size does not fit all: Implementation trade-offs for iterative stencil computations on FPGAs

Gael Deest, Tomofumi Yuki, Sanjay Rajopadhye, Steven Derrien
2017 2017 27th International Conference on Field Programmable Logic and Applications (FPL)  
We generate a family of FPGA stencil accelerators targeting emerging System on Chip platforms, (e.g., Xilinx Zynq or Intel SoC). Our designs come with design knobs to explore trade-offs.  ...  One size does not fit all: Implementation trade-offs for iterative stencil computations on FPGAs.  ...  TABLE I : I Qualitative analysis of existing approach to accelerate stencil on FPGAs.  ... 
doi:10.23919/fpl.2017.8056781 dblp:conf/fpl/DeestYRD17 fatcat:m3qyfahhrvakvnsimtk7f6xley

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, Mary Hall
2017 Parallel Computing  
In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU-and GPU-accelerated platforms  ...  GPU-based architectures as well as for a multiple stencil discretizations and smoothers.  ...  Compiler Optimizations, DSLs and Programming Models for stencils There have been many domain-specific approaches to optimizing stencils on GPU accelerators.  ... 
doi:10.1016/j.parco.2017.04.002 fatcat:cmhgr2cobzdzhkbb4h5ylk22hm

Patus for convenient high-performance stencils: Evaluation in earthquake simulations

Matthias Christen, Olaf Schenk, Yifeng Cui
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
We evaluate the performance by focusing on a scalable discretization of the wave equation and testing complex simulation types of the AWP-ODC code to aim at excellent parallel efficiency, preparing for  ...  Its stencil specification language allows the programmer to express the computation in a concise way independently of hardware architecture-specific details.  ...  ACKNOWLEDGMENT We acknowledge the Swiss National Supercomputing Center and Intel's Parallel and Distributed Solutions Division for giving us access to their hardware and for their support.  ... 
doi:10.1109/sc.2012.95 dblp:conf/sc/ChristenSC12 fatcat:hw3iorrp5nbdjfducse6r6r5mm

Beyond 16GB: Out-of-Core Stencil Computations [article]

Istvan Z Reguly, Gihan R Mudalige, Michael B Giles
2017 arXiv   pre-print
Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with  ...  Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15\% loss  ...  ACKNOWLEDGMENTS e authors would like to thank IBM (József Surányi, Michal Iwanski) for access to a Minsky system, as well as Nikolay Sakharnykh at NVIDIA for the help with uni ed memory performance. is  ... 
arXiv:1709.02125v2 fatcat:tsdxnsxznfai5ooi5we5ujfbcu

Beyond 16GB

Istán Z. Reguly, Gihan R. Mudalige, Michael B. Giles
2017 Proceedings of the Workshop on Memory Centric Programming for HPC - MCHPC'17  
Stencil computations are a key class of applications, widely used in the scienti c computing community, and a class that has particularly bene ted from performance improvements on architectures with high  ...  Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15% loss  ...  ACKNOWLEDGMENTS e authors would like to thank IBM (József Surányi, Michal Iwanski) for access to a Minsky system, as well as Nikolay Sakharnykh at NVIDIA for the help with uni ed memory performance. is  ... 
doi:10.1145/3145617.3145619 dblp:conf/sc/RegulyMG17 fatcat:vi24kzcc5zdhxdrwvpi6bcc6r4

High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers

Kamalavasan Kamalakkannan, Gihan R. Mudalige, Istvan Z. Reguly, Suhaib A. Fahmy
2021 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)  
This paper presents a workflow for synthesizing near-optimal FPGA implementations of structured-mesh based stencil applications for explicit solvers.  ...  We discuss determinants for a given stencil code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design and its resulting performance.  ...  In contrast [20] develop a Scalable Streaming Array to implement stencil computations on multiple FPGAs, using a DSL, achieving reduced development time and near-peak performance.  ... 
doi:10.1109/ipdps49936.2021.00117 fatcat:l72imzolbresvdhtamaxaxhniu

Transformations of High-Level Synthesis Codes for High-Performance Computing [article]

Johannes de Fine Licht, Maciej Besta, Simon Meierhans, Torsten Hoefler
2020 arXiv   pre-print
We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures.  ...  To alleviate this, we present a collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications.  ...  To see how streaming can be an important tool to express scalable hardware, we apply it in conjunction with vertical unrolling (Sec. 3.2) to implement an iterative version of the stencil example from Lst  ... 
arXiv:1805.08288v6 fatcat:rklumgxixbg2dfglgwcfrxd3se
« Previous Showing results 1 — 15 out of 618 results