676 Hits in 5.0 sec

GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement [chapter]

Hartwig Anzt, Piotr Luszczek, Jack Dongarra, Vincent Heuveline
2012 Lecture Notes in Computer Science  
block-asynchronous iteration and an iterative refinement method using double precision for the error correction solver.  ...  Therefore, we implement a mixed precision iterative refinement algorithm using a block-asynchronous iteration as an error correction solver, and compare its performance with a pure implementation of a  ...  iteration in double precision, the iterative refinement in double precision and the mixed precision iterative refinement, whereas the latter ones use the block-asynchronous iteration as an error correction  ... 
doi:10.1007/978-3-642-32820-6_89 fatcat:qnghq4ayrba3vocjjmltplctzu

Multi GPU Performance of Conjugate Gradient Solver with Staggered Fermions in Mixed Precision [article]

Yong-Chull Jang, Hyung-Jin Kim, Weonjong Lee
2011 arXiv   pre-print
We have implemented mixed precision algorithm to our multi GPU conjugate gradient solver.  ...  The overall performance of our CUDA code for CG is 145 giga flops per GPU (GTX480), which does not include the infiniband network communication.  ...  To improve this performance of the CG program, the mixed precision method has been used. Mixed precision is implemented by iterative refinement algorithm [2] .  ... 
arXiv:1111.0125v1 fatcat:ggjar3twx5e23c7n26d6k25mue


Namjae Choi, Hansol Park, Han Gyu Lee, Seungug Jae, Sori Jeon, Han Gyu Joo, M. Margulis, P. Blaise
2021 EPJ Web of Conferences  
Several innovative cross section treatment methods were developed, a new axial transport solver was introduced for stabilizing the 2D/1D scheme, and substantial computational enhancements were achieved  ...  Mixed precision in CMFD power iteration is achieved by iterative refinement technique, as described in Algorithm 2, where the subscripts indicate the floating point precision in bytes.  ...  We also employ mixed precision techniques to exploit the gaming GPUs while preserving the accuracy.  ... 
doi:10.1051/epjconf/202124706033 fatcat:vlp5pyjsujeonasyac2zh3qe5u

Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU

Dominik Goddeke, Hilmar Wobker, Robert Strzodka, Jamaludin Mohd Yusof, Patrick McCormick, Stefan Turek
2009 International Journal of Computational Science and Engineering (IJCSE)  
We present accuracy experiments, a scalability test and acceleration results for different elastic objects under load.  ...  In particular, we demonstrate in detail that the single precision execution of the co-processor does not affect the final accuracy.  ...  Also thanks to NVIDIA and AMD for donating hardware that was used in developing the serial version of the GPU backend.  ... 
doi:10.1504/ijcse.2009.029162 fatcat:apmjkuwxc5bufjipmr2jwnpja4

Using GPUs to improve multigrid solver performance on a cluster

Dominik Goddeke, Robert Strzodka, Jamaludin Mohd Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker, Stefan Turek
2008 International Journal of Computational Science and Engineering (IJCSE)  
We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU.  ...  From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer  ...  The resulting technique of mixed precision iterative refinement has already been introduced in the 1960s [45] .  ... 
doi:10.1504/ijcse.2008.021111 fatcat:dneg67kvhnc5hjjeocpaji3oxi

Tinker-HP : Accelerating Molecular Dynamics Simulations of Large Complex Systems with Advanced Point Dipole Polarizable Force Fields using GPUs and Multi-GPUs systems [article]

Olivier Adjoua, Louis Lagardère, Luc-Henri Jolly, Arnaud Durocher, Thibaut Very, Isabelle Dupays, Zhi Wang, Théo Jaffrelot Inizan, Frédéric Célerse, Pengyu Ren, Jay W. Ponder, Jean-Philip Piquemal
2021 arXiv   pre-print
., 2018,9, 956-972) to the use of Graphics Processing Unit (GPU) cards to accelerate molecular dynamics simulations using polarizable many-body force fields.  ...  The new high-performance module allows for an efficient use of single- and multi-GPU architectures ranging from research laboratories to modern supercomputer centers.  ...  This project was initiated in 2019 with a "Contrat de Progrès" grant from GENCI (France) in collaboration with HPE and NVIDIA to port Tinker-HP on the Jean Zay HPE SGI 8600 GPUs system (IDRIS supercomputer  ... 
arXiv:2011.01207v4 fatcat:be7wgtzaxbasjagguniha3l3hm

Linear algebra software for large-scale accelerated multicore computing

A. Abdelfattah, H. Anzt, J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki, A. YarKhan
2016 Acta Numerica  
Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently  ...  of high interest for accelerated multicore systems.  ...  Figure 9 . 1 . 91 Mixed precision, iterative refinement method from MAGMA 1.6.2 for the solution of dense linear systems on the NVIDIA's TITAN X GPU.  ... 
doi:10.1017/s0962492916000015 fatcat:cwsstweghjaj7ff6fu62lmn6ce

Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs [article]

Manuel Birke, Bobby Philip, Zhen Wang, Mark Berrill
2019 arXiv   pre-print
Block iterative methods are extremely important as smoothers for multigrid methods, as preconditioners for Krylov methods, and as solvers for diagonally dominant linear systems.  ...  Experimental results for NVIDIA Fermi GPUs and AMD multicore systems are presented.  ...  The authors would like to thank Rebecca Hartman-Baker from iVEC for the very useful additions and corrections to this paper, James Schwarzmeier from CRAY Inc. for providing access to the CRAY XE6 system  ... 
arXiv:1208.1975v3 fatcat:4irc6pc7xjhlridg5j7vslahrm

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization [article]

Kazuaki Matsumura, Simon Garcia De Gonzalo, Antonio J. Peña
2021 arXiv   pre-print
Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential code  ...  While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications.  ...  We would like to acknowledge the NVIDIA AI Technology Center (NVAITC) Europe for their valuable help.  ... 
arXiv:2110.14340v1 fatcat:acfa6g7xm5dyfajen7fqkn4yri

CanvoX: High-resolution VR Painting in Large Volumetric Canvas [article]

Yeojin Kim, Byungmoon Kim, Jiyang Kim, Young J. Kim
2017 arXiv   pre-print
In CPU side, we design an efficient iterative algorithm to refine or coarsen octree, as a result of volumetric painting strokes, at highly interactive rates, and update the corresponding GPU textures.  ...  Technically, our canvas is represented as an array of deep octrees of depth 24 or higher, built on CPU for volume painting and on GPU for volume rendering using accurate ray casting.  ...  When an artist paints over and over to correct or mix colors, there will be no color mixing effect and this only increases the stroke count which steadily degrades system performance.  ... 
arXiv:1704.02724v1 fatcat:nw5trw4b2beirfrslplwoegbhq

Parallel ptychographic reconstruction

Youssef S. G. Nashed, David J. Vine, Tom Peterka, Junjing Deng, Rob Ross, Chris Jacobsen
2014 Optics Express  
Here we present a parallel method for real-time ptychographic phase retrieval.  ...  Ptychography is an imaging method whereby a coherent beam is scanned across an object, and an image is obtained by iterative phasing of the set of diffraction patterns.  ...  We also would like to thank our reviewers for their constructive comments. The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory ("Argonne").  ... 
doi:10.1364/oe.22.032082 pmid:25607174 pmcid:PMC4317139 fatcat:nkm4znhipjhvnpqrxetkjle6yy

Strength Check of Aircraft Parts Based on Multi-GPU Clusters for Fast Calculation of Sparse Linear Equations

Yuhua Zhang, Binxing Hu
2020 IEEE Access  
For the problem of iterative solution of matrix preprocessing, two preprocessing strategies of matrix bandwidth reduction parallelization and incomplete Cholesky decomposition are proposed, and asynchronous  ...  For more information, see VOLUME 8, 2020  ...  Get the correction factor β j = r jC1 , r jC1 / r j , r j 5. Get search direction vector for next iteration p jC1 = z jC1 + β j p j 6. Obtain the correction factor α i = r j , z j / p j , Ap j 7.  ... 
doi:10.1109/access.2020.2991099 fatcat:3vps5yghm5hojeqrrjwbdefxnu

Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism

Saar Eliad, Ido Hakimi, Alon De Jagger, Mark Silberstein, Assaf Schuster
2021 USENIX Annual Technical Conference  
Moreover, and perhaps surprisingly, when applied to asynchronous training, Mixed-pipe has negligible or no effect on the end-to-end accuracy of fine-tuning tasks despite the addition of pipeline stages  ...  We present FTPipe, a system that explores a new dimension of pipeline model parallelism, making multi-GPU execution of fine-tuning tasks for giant neural networks readily accessible on commodity hardware  ...  Acknowledgements We thank our shepherd Tim Harris for the insightful comments that helped us improve the paper.  ... 
dblp:conf/usenix/EliadHJSS21 fatcat:g4yxcepuwzcrtb3r45qngmaf5e

Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Ronald Babich, Michael A. Clark, Balint Joó
2010 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis  
The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA).  ...  We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.  ...  ACKNOWLEDGMENT The authors would like to thank Chip Watson for funding an extremely productive week of coding, and for dedicated access to the Jefferson Lab 9g cluster.  ... 
doi:10.1109/sc.2010.40 dblp:conf/sc/BabichCJ10 fatcat:vov3qo4fevd6pg42pya5mqugoi

Resolution-matched shadow maps

Aaron E. Lefohn, Shubhabrata Sengupta, John D. Owens
2007 ACM Transactions on Graphics  
a correct solution.  ...  Our main contribution is the observation that it is more efficient to forgo the iterative refinement analysis in favor of generating all shadow texels requested by the pixels in the eye-space image.  ...  Note the refinement error in the lower-left corner of the iterative ASM refinement image (missed hair shadows highlighted by the orange box)(c).  ... 
doi:10.1145/1289603.1289611 fatcat:ryf233nu7ffbvh2rmw3e24sdta
« Previous Showing results 1 — 15 out of 676 results