304 Hits in 5.9 sec

An initial investigation of the performance of GPU-based swept time-space decomposition [article]

Daniel Magee, Kyle E Niemeyer
2017 arXiv   pre-print
The GPU implementation of swept time-space decomposition presented here mitigates this dilemma by using private (shared) memory, avoiding internode communication, and overwriting unnecessary values.  ...  It shows significant improvement in the execution time of the PDE solvers in one dimension achieving speedups of 6-2x for large and small problem sizes respectively compared to naive GPU versions and 7  ...  We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.  ... 
arXiv:1612.02495v2 fatcat:szemkhqylrgqxivgt2hudj7age

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems [article]

Daniel J. Magee, Anthony S. Walker, Kyle E. Niemeyer
2020 arXiv   pre-print
The swept time-space decomposition rule is a communication-avoiding technique for time-stepping stencil update formulas that attempts to reduce latency costs.  ...  We compare our approach to a naive decomposition scheme with two test equations using an MPI+CUDA pattern on 40 processes over two nodes containing one GPU.  ...  Availability of material The software package hSweep v2.0 used to perform this study is available openly [29] ; the most recent version can be found at its GitHub repository shared under an MIT License  ... 
arXiv:1811.08282v2 fatcat:pcdxjrayjbfvtbfqtae7f3yany

Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time–space decomposition

Daniel J. Magee, Kyle E. Niemeyer
2018 Journal of Computational Physics  
The swept time-space decomposition rule reduces communication between sub-domains by exhausting the domain of influence before communicating boundary values.  ...  Here we present a GPU implementation of the swept rule, which modifies the algorithm for improved performance on this processing architecture by prioritizing use of private (shared) memory, avoiding interblock  ...  We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.  ... 
doi:10.1016/ fatcat:h2mj5hz3kfgmhghf6tqvrrnyqa

The swept rule for breaking the latency barrier in time advancing PDEs

Maitham Alhubail, Qiqi Wang
2016 Journal of Computational Physics  
This article investigates the swept rule of space-time domain decomposition, an idea to break the latency barrier via communicating less often when explicitly solving time-dependent PDEs.  ...  The swept rule decomposes space and time among computing nodes in ways that exploit the domains of influence and the domain of dependency, making it possible to communicate once per many timesteps without  ...  Acknowledgment We acknowledge the Advanced Research Center at Saudi Aramco for spon-  ... 
doi:10.1016/ fatcat:nrncfg5kl5b63kjcq4nejgugdi

Applying the Swept Rule for Solving Two-Dimensional Partial Differential Equations on Heterogeneous Architectures

Anthony S. Walker, Kyle E. Niemeyer
2021 Mathematical and Computational Applications  
The partial differential equations describing compressible fluid flows can be notoriously difficult to resolve on a pragmatic scale and often require the use of high-performance computing systems and/or  ...  The swept rule is a technique designed to minimize these costs by obtaining a solution to unsteady equations at as many possible spatial locations and times prior to communicating.  ...  Acknowledgments: We gratefully acknowledge the support of NVIDIA Corporation, who donated a Tesla K40c GPU used in developing this research.  ... 
doi:10.3390/mca26030052 fatcat:jg6uezmgvzekvf6tzne3au7u5u

Performance evaluation of CUDA programming for 5-axis machining multi-scale simulation

Felix Abecassis, Sylvain Lavernhe, Christophe Tournier, Pierre-Alain Boucard
2015 Computers in industry (Print)  
Several strategies for parallel computing are investigated and compared to single-threaded and multi-threaded CPU, relatively to the complexity of the simulation.  ...  Thus, the aim of this paper is to evaluate Nvidia CUDA architecture to speed-up Z-buffer or N-buffer machining simulations.  ...  They did not investigate the traditional approach based on Voxelmap or Z-buffer but a novel approach based on swept volume, discretization of the tool flutes and a polygonal approximation of the workpiece  ... 
doi:10.1016/j.compind.2015.02.007 fatcat:sbdvs73opre7plzysny6hgwj2a

Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

M. Howison, E. W. Bethel, H. Childs
2012 IEEE Transactions on Visualization and Computer Graphics  
among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible.  ...  We show that reducing the number of participants with a hybrid approach significantly improves performance. Index Terms-Volume visualization, parallel processing • M.  ...  The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the GPU cluster Longhorn that contributed to the research results reported within this  ... 
doi:10.1109/tvcg.2011.24 pmid:21282855 fatcat:iosltusqizf6zjrrkxirthttsm

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale [chapter]

Tom Deakin, Simon McIntosh-Smith, Wayne Gaudin
2016 Lecture Notes in Computer Science  
We validate our results against an improved performance model which predicts the runtime of the main 'sweep' routine when running on different hardware, including CPUs or GPUs.  ...  In this paper we extend our work to large problems and demonstrate the scalability of our solution on two Petascale GPU-based supercomputers: Titan at Oak Ridge and Piz Daint at CSCS.  ...  We used the KBA algorithm for spatial decomposition and saturated the GPU devices with work by solving all angles and energy groups for all cells in the local wavefront.  ... 
doi:10.1007/978-3-319-41321-1_22 fatcat:4tyord5dg5gjxinflnekgfxuxi

A parallel scheme for accelerating parameter sweep applications on a GPU

Fumihiko Ino, Kentaro Shigeoka, Tomohiro Okuyama, Masaya Motokubota, Kenichi Hagihara
2013 Concurrency and Computation  
To the best of our knowledge, our study is the first that tackles the issue of irregular memory references by an appropriate organization of computational tasks.  ...  In several experiments, we applied our scheme to practical applications, and found that our scheme can perform up to 8.5 times faster than a naive scheme that processes a single parameter at a time.  ...  ACCELERATING PARAMETER SWEEP APPLICATIONS ON A GPU ACKNOWLEDGEMENTS This study was partially supported by JSPS KAKENHI Grants 23300007 and 23700057 and by the JST CREST program "An Evolutionary Approach  ... 
doi:10.1002/cpe.3016 fatcat:tylepckvmnhublg5dhtr5ae4zi

Graphics-Processor-Unit-Based Parallelization of Optimized Baseline Wander Filtering Algorithms for Long-Term Electrocardiography

Thomas Niederhauser, Thomas Wyss-Balmer, Andreas Haeberlin, Thanks Marisa, Reto A. Wildhaber, Josef Goette, Marcel Jacomet, Rolf Vogel
2015 IEEE Transactions on Biomedical Engineering  
However, the parallelized wavelet filter is processed 500 and four times faster than these two algorithms on the GPU, respectively, and offers superior baseline wander suppression in low SBR situations  ...  Here, we present a graphics processor unit (GPU)-based parallelization method to speed up offline baseline wander filter algorithms, namely the wavelet, finite, and infinite impulse response, moving mean  ...  ACKNOWLEDGMENT The authors would like to thank to all ARTORG coworkers involved in the development of the ECG software.  ... 
doi:10.1109/tbme.2015.2395456 pmid:25675449 fatcat:apedrbt7m5ftvouahleapbbefi

High accuracy NC milling simulation using composite adaptively sampled distance fields

Alan Sullivan, Huseyin Erdim, Ronald N. Perry, Sarah F. Frisken
2012 Computer-Aided Design  
The computation of distance field of the swept volume of a milling tool is handled by an inverted trajectory approach where the problem is solved in tool coordinate frame instead of a world coordinate  ...  the original workpiece volume and distance fields representing the volumes of the milling tool swept along the prescribed milling path.  ...  As a result, these decompositions consume a considerable amount of memory and time.  ... 
doi:10.1016/j.cad.2012.02.002 fatcat:mqahu6b2afgd7ongovaihce3oa

Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference [article]

Brandon Reagen, Wooseok Choi, Yeongil Ko, Vincent Lee, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks
2020 arXiv   pre-print
To bridge the remaining performance gap, Cheetah further proposes an accelerator architecture that, when combined with the algorithmic optimizations, approaches plaintext DNN inference speeds.  ...  As the application of deep learning continues to grow, so does the amount of data used to make predictions.  ...  A deeper investigation reveals that most of the time in "Other" functions is construction and destruction overhead.  ... 
arXiv:2006.00505v2 fatcat:egwg4lrzdjbdtkr73znlhocn3e

A GPU-Accelerated Barycentric Lagrange Treecode [article]

Nathan Vaughn, Leighton Wilson, Robert Krasny
2020 arXiv   pre-print
We present an MPI + OpenACC implementation of the kernel-independent barycentric Lagrange treecode (BLTC) for fast summation of particle interactions on GPUs.  ...  of the barycentric particle-cluster approximation provides an inner level of parallelism.  ...  also explore GPU acceleration of barycentric cluster-particle and cluster-cluster treecodes [30] - [32] .  ... 
arXiv:2003.01836v2 fatcat:4i4jqow3bjay5jri2gewp24qdq

AbacusSummit: A Massive Set of High-Accuracy, High-Resolution N-Body Simulations

Nina A Maksimova, Lehman H Garrison, Daniel J Eisenstein, Boryana Hadzhiyska, Sownak Bose, Thomas P Satterthwaite
2021 Monthly notices of the Royal Astronomical Society  
We present the public data release of the AbacusSummit cosmological N-body simulation suite, produced with the Abacus N-body code on the Summit supercomputer of the Oak Ridge Leadership Computing Facility  ...  second per node at late times.  ...  for their highly responsive and expert assistance, both scientific and administrative, during the course of this project.  ... 
doi:10.1093/mnras/stab2484 fatcat:wzwwordkgverfptbyrhq7k56zm

Accelerating Numerical Dense Linear Algebra Calculations with GPUs [chapter]

Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Ichitaro Yamazaki
2014 Numerical Computations with GPUs  
The implementations are available through the MAGMA library -a redesign for GPUs of the popular LAPACK.  ...  To generate the extreme level of parallelism needed for the efficient use of GPUs, algorithms of interest are redesigned and then split into well-chosen computational tasks.  ...  Introduction Enabling large scale use of GPU-based architectures for high performance computational science depends on the successful development of fundamental numerical libraries for GPUs.  ... 
doi:10.1007/978-3-319-06548-9_1 fatcat:43umukcpobgbthrpwiww662f7m
« Previous Showing results 1 — 15 out of 304 results