Filters








853 Hits in 5.3 sec

Proposed Consistent Exception Handling for the BLAS and LAPACK [article]

James Demmel, Jack Dongarra, Mark Gates, Greg Henry, Julien Langou, Xiaoye Li, Piotr Luszczek, Weslley Pereira, Jason Riedy, Cindy Rubio-González
2022 arXiv   pre-print
In this paper we explore the design space of consistent exception handling for the widely used BLAS and LAPACK linear algebra libraries, pointing out a variety of instances of inconsistent exception handling  ...  in the current versions, and propose a new design that balances consistency, complexity, ease of use, and performance.  ...  Acknowledgements This work was supported in part by the National Science Foundation under the project Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC)  ... 
arXiv:2207.09281v1 fatcat:wvhjakxdtne2fpbnumhajxj724

Evaluation of two topology-aware heuristics on level- 3 BLAS library for multi-GPU platforms

Thierry Gautier, Joao V. F Lima
2021 2021 SC Workshops Supplementary Proceedings (SCWS)  
The second is an optimistic heuristic that favor communication between devices. These have been implemented in the XKBLAS library BLAS-3 library.  ...  Several software libraries have been developed for exploiting performance of systems with accelerators, but the real performance may be far from the platform peak performance with multiple GPUs.  ...  Experiments on DGX-1 were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations  ... 
doi:10.1109/scws55283.2021.00013 fatcat:m7tfjpcrdzbujmyhjcd656wlne

Case studies on the development of ScaLAPACK and the NAG Numerical PVM Library [chapter]

J. J. Dongarra, S. Hammarling, A. Petitet
1997 IFIP Advances in Information and Communication Technology  
In this paper we look at the development of ScaLAPACK, a software library for dense and banded numerical linear algebra, and the NAG Numerical PVM Library, which includes software for dense and sparse  ...  The paper concentrates on the underlying design and the testing of the libraries.  ...  ACKNOWLEDGMENTS We wish to thank our ScaLAPACK colleagues at the University of Tennessee at Knoxville and the University of California at Berkeley, as well as colleagues at NAG involved with the development  ... 
doi:10.1007/978-1-5041-2940-4_18 fatcat:2wwpbtisdbgolkawxxwnbnpswq

BLASFEO: basic linear algebra subroutines for embedded optimization [article]

Gianluca Frison, Dimitris Kouzoupis, Tommaso Sartor, Andrea Zanelli, Moritz Diehl
2018 arXiv   pre-print
level 3 BLAS routines and 2-3 times faster than the corresponding LAPACK routines.  ...  and embeddability and optimized for very small matrices, and a wrapper to standard BLAS and LAPACK providing high-performance on large matrices.  ...  Sections 4.6 and 4.7 apply the proposed implementation scheme to other level 3 BLAS and LAPACK routines with focus on small-scale performance.  ... 
arXiv:1704.02457v3 fatcat:yb3pfkvanvatdh3phhaq6pgfpy

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek
2013 Concurrency and Computation  
In particular, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to 3-fold faster than LAPACK with multithreaded Intel MKL BLAS.  ...  This paper proposes a novel approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance, but also sustains the numerical quality  ...  They implemented recursive versions of the main LAPACK and BLAS kernels involved in the factorization i.e., xGETRF and xGEMM, xTRSM, respectively.  ... 
doi:10.1002/cpe.3110 fatcat:pfmm2bmt5vcx5d6h35mtp4h3uu

OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK Libraries

C. Addison, Y. Ren, M. van Waveren
2003 Scientific Programming  
Dense linear algebra libraries need to cope efficiently with a range of input problem sizes and shapes.  ...  The inherent flexible nature of shared memory paradigms such as OpenMP poses other difficulties when it becomes necessary to optimise performance across successive parallel library calls.  ...  Consistent with this pattern, Fujitsu recently released its first SMP version of the parallel BLAS and LAPACK libraries for its PRIMEPOWER series.  ... 
doi:10.1155/2003/278167 fatcat:lzv24cfln5gixmmxsrflbaskiq

The BLAS API of BLASFEO: optimizing performance for small matrices [article]

Gianluca Frison, Tommaso Sartor, Andrea Zanelli, Moritz Diehl
2020 arXiv   pre-print
This paper investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes.  ...  This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache.  ...  Therefore, the proposed approach shows a good and consistent performance also in the case of rectangular matrices, and no inherent performance drawback is found.  ... 
arXiv:1902.08115v4 fatcat:vllv5smaqnhedaeqdjj7utmmaa

Automatic translation of Fortran to JVM bytecode

Keith Seymour, Jack Dongarra
2001 Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande - JGI '01  
The goal of the translator is to generate Java implementations of legacy Fortran numerical codes in a consistent and reliable fashion. The benefits of directly generating bytecode are twofold.  ...  First, compared with generating Java source code, it provides a much more straightforward and efficient mechanism for translating Fortran GOTO statements.  ...  Correctness To date, the BLAS and LAPACK libraries have been the main testbed for f2j.  ... 
doi:10.1145/376656.376833 fatcat:exedzoxzzfhh7nqnikylysfmma

Compiler blockability of dense matrix factorizations

Steve Carr, R. B. Lehoucq
1997 ACM Transactions on Mathematical Software  
The goal of the LAPACK project is to provide efficient and portable software for dense numerical linear algebra computations.  ...  We believe that it is better for the programmer to express algorithms in a machine-independent form and allow the compiler to handle the machine-dependent details.  ...  We also thank Per Ling of the University of Umeå and Ken Stanley of the University of California Berkeley for their help with the benchmarks and discussions.  ... 
doi:10.1145/275323.275325 fatcat:7nvbukfapjeelbrg5dei6n63be

Automatic translation of Fortran to JVM bytecode

Keith Seymour, Jack Dongarra
2003 Concurrency and Computation  
The goal of the translator is to generate Java implementations of legacy Fortran numerical codes in a consistent and reliable fashion. The benefits of directly generating bytecode are twofold.  ...  First, compared with generating Java source code, it provides a much more straightforward and efficient mechanism for translating Fortran GOTO statements.  ...  Correctness To date, the BLAS and LAPACK libraries have been the main testbed for f2j.  ... 
doi:10.1002/cpe.657 fatcat:fmxsxrahwfgxnb52teisvzt6eq

An updated set of basic linear algebra subprograms (BLAS)

L. Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R. Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, Michael Heroux, Linda Kaufman (+1 others)
2002 ACM Transactions on Mathematical Software  
We would like to thank the members of the global community who have posted comments, suggestions, and proposals to the e-mail reflector and the BLAS Technical Forum webpage.  ...  We thank Paul McMahan of the University of Tennessee for preparing the commenting and voting pages on the BLAS Technical Forum webpage.  ...  ERROR HANDLING The BLAS Technical Forum standard supports two types of error-handling capabilities: an error handler, BLAS ERROR, and error return codes.  ... 
doi:10.1145/567806.567807 fatcat:j7rpod7g2zagtjdnq4nnesgtsa

PB-BLAS: a set of parallel block basic linear algebra subprograms

Jaeyoung Choi, Jack J. Dongarra, David W. Walker
1996 Concurrency Practice and Experience  
The PB-BLAS are the building blocks for implementing ScaLAPACK, the distributed-memory version of LAPACK, and provide the same ease-of-use and portability for ScaLAPACK that the BLAS provide for LAPACK  ...  The PB-BLAS consist of calls to the sequential BLAS for local computations, and calls to the BLACS for communication.  ...  The PB-BLAS consist of calls to the sequential BLAS for local computations and calls to the BLACS for communication.  ... 
doi:10.1002/(sici)1096-9128(199609)8:7<517::aid-cpe226>3.0.co;2-w fatcat:tiugqu2zi5cfxa4in5tkj4sfm4

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations [chapter]

Azzam Haidar, Tingxing Tim Dong, Stanimire Tomov, Piotr Luszczek, Jack Dongarra
2015 Lecture Notes in Computer Science  
Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution.  ...  We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations  ...  Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. ACI-, the Department of Energy, and Intel.  ... 
doi:10.1007/978-3-319-20119-1_3 fatcat:hgupli7firaqhebgiuojf35xwq

A Parallel Multi-threaded Solver for Symmetric Positive Definite Bordered-Band Linear Systems [chapter]

Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí, Alfredo Remón
2016 Lecture Notes in Computer Science  
allows to cast the bulk of the computations in terms of ecient kernels from the Level-3 and Level-2 BLAS.  ...  The algorithms that implement this approach heavily rely on a compact storage format, tailored for this type of matrices, that reduces the memory requirements, produces a regular data access pattern, and  ...  The researcher from Universidad Jaime I was supported by the CICYT project TIN2011-23283 of the Ministerio de Economía y Competitividad and FEDER.  ... 
doi:10.1007/978-3-319-32149-3_10 fatcat:p5cfp6vvhveozd7p7hkdduooky

Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

Piotr Luszczek, Hatem Ltaief, Jack Dongarra
2011 2011 IEEE International Parallel & Distributed Processing Symposium  
The obtained tile tridiagonal reduction significantly outperforms the state-of-theart numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical  ...  While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for two-sided reductions  ...  For n = 16000, the tile LL TRD runs roughly at 28 Gflop/s, which is 10 times faster than multithreaded LAPACK with optimized MKL BLAS and 2.5 times faster than the vendor optimized MKL SBR TRD.  ... 
doi:10.1109/ipdps.2011.91 dblp:conf/ipps/LuszczekLD11 fatcat:varuhyzzl5asvm4j5vhgz3ax44
« Previous Showing results 1 — 15 out of 853 results