A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2014; you can also visit the original URL.
The file type is application/pdf
.
Filters
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
2007
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07
We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. ...
We show that this facilitates the adoption of techniques akin to dynamic scheduling and out-of-order execution usual in superscalar processors, which we name SuperMatrix Out-of-Order scheduling. ...
We thank the other members of the FLAME team for their support. ...
doi:10.1145/1248377.1248397
dblp:conf/spaa/ChanQQG07
fatcat:ydn5blbqjfgijokgyjctbjbwnq
An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization
[chapter]
2008
Lecture Notes in Computer Science
The SuperMatrix run-time system allows an out-of-order scheduling of operations that is transparent to the programmer. ...
We pursue the scalable parallel implementation of the factorization of band matrices with medium to large bandwidth targeting SMP and multi-core architectures. ...
We thank John Gilbert and Vikram Aggarwal from the University of California at Santa Barbara for granting the access to the neumann platform. ...
doi:10.1007/978-3-540-92859-1_21
fatcat:4jv4sopocncjpm3a665pn3jjh4
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures
2008
16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multi-core architectures. ...
The SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer. ...
Acknowledgments We thank the other members of the FLAME team for their support. This research was partially sponsored by NSF grants CCF-0540926 and CCF-0702714. ...
doi:10.1109/pdp.2008.37
dblp:conf/pdp/Quintana-OrtiQCGZ08
fatcat:coti2x5np5cx7ddqxs5g3otbou
Programming matrix algorithms-by-blocks for thread-level parallelism
2009
ACM Transactions on Mathematical Software
A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. ...
We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome ...
We thank John Gilbert and Vikram Aggarwal from the University of California at Santa Barbara for granting access to the neumann platform. ...
doi:10.1145/1527286.1527288
fatcat:keoskxqyinc2pp47v2ly6yzski
A dependency-aware task-based programming environment for multi-core architectures
2008
2008 IEEE International Conference on Cluster Computing
Parallel programming on SMP and multi-core architectures is hard. ...
We present the programming environment in the context of algorithms from several domains and pinpoint its benefits compared to other approaches. We discuss its execution model and its scheduler. ...
We also would like to thank Eduard Ayguadé for his detailed and insightful comments on a previous manuscript of this paper. ...
doi:10.1109/clustr.2008.4663765
dblp:conf/cluster/PerezBL08
fatcat:4uiqpbcwa5b6vpcpz2vwgavtvm
BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing
[article]
2015
arXiv
pre-print
We then present BLASX: a highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. ...
The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems. ...
The key insight of SuperMatrix is that a matrix can be factorized into a set of tiles. The Tomasolu algorithm [3] subsequently schedules these tiles in the out-of-order fashion. ...
arXiv:1510.05041v1
fatcat:wblbbwbzuzffxfkhmhzyf7moem
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
2012
2012 International Conference for High Performance Computing, Networking, Storage and Analysis
Take a multicore Digital Signal Processor (DSP) chip designed for cellular base stations and radio network controllers, add floating-point capabilities to support 4G networks, and out of thin air a HPC ...
The potential for HPC is clear: It promises 128 GFLOPS (single precision) for 10 Watts; It is used in millions of network related devices and hence benefits from economies of scale; It should be simpler ...
Previous SuperMatrix implementations The SuperMatrix runtime was originally designed and developed for SMP architectures, using OpenMP or pthreads as the underlying thread support [3] . ...
doi:10.1109/sc.2012.109
dblp:conf/sc/IgualAFSWG12
fatcat:gnkl5kwf2bad7axiszdfzeu2hq
Solving dense linear systems on platforms with multiple hardware accelerators
2009
SIGPLAN notices
Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix ...
product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations. ...
Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational Engineering and Sciences (ICES) at UT-Austin. ...
doi:10.1145/1594835.1504196
fatcat:6xan7rjzkffetczdtrau3yg5w4
Solving dense linear systems on platforms with multiple hardware accelerators
2008
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '09
Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix ...
product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations. ...
Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational Engineering and Sciences (ICES) at UT-Austin. ...
doi:10.1145/1504176.1504196
dblp:conf/ppopp/Quintana-OrtiIQG09
fatcat:hk2uw4hnkve7lfv5yf7k733rxi
Solving "large" dense matrix problems on multi-core processors
2009
2009 IEEE International Symposium on Parallel & Distributed Processing
on a fast multithreaded architecture like an SMP or multi-core computer. ...
in and out of main memory) is no more difficult than programming high-performance implementations for the case where the matrix is in memory. ...
Acknowledgements The researchers at the Universidad Jaime I were supported by projects CICYT TIN2005-09037-C02-02, TIN2008-06570-C04-01 and FEDER, and P1B-2007-19 of the Fundación Caixa-Castellón/Bancaixa ...
doi:10.1109/ipdps.2009.5161162
dblp:conf/ipps/MarquesQQG09
fatcat:iseganzxjnbq3e6wws3wwkdv3m
Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures
2011
Concurrency and Computation
The PLASMA library (Parallel Linear Algebra for Scalable Multi-core Architectures) developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity ...
The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared-memory, multicore architectures. ...
The SMP superscalar (SMPSs) project [5] [21] from the Barcelona Supercomputing Center is a programming environment for shared memory, multi-core architectures focused on the ease of programming, portability ...
doi:10.1002/cpe.1829
fatcat:ynqg2xr2wba2ri36bdwrbhc3qy
Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing
2015
Supercomputing Frontiers and Innovations
The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. ...
Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a twofold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while ...
The rigid panel-update sequence, previously described in Section 3.1, is now replaced by an out-of-order task execution flow, where computational tasks operating on tiles from different loop iterations ...
doi:10.14529/jsfi150103
fatcat:l7mujkltgzh25oy66xnyz63iei
Hierarchical Task-Based Programming With StarSs
2009
The international journal of high performance computing applications
Programming models for multicore and many-core systems are listed as one of the main challenges in the near future for computing research. ...
The preliminary results obtained when executing a matrix multiplication and a Cholesky factorization show the viability and potential of the approach and the current issues raised. ...
Since 1981 he has been lecturing on computer architecture, operating systems, computer networks and performance evaluation. ...
doi:10.1177/1094342009106195
fatcat:ykbrmwbis5hxhpqskolh4zd5a4
A scalable framework for heterogeneous GPU-based clusters
2012
Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures - SPAA '12
By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [25] using ...
100 nodes, each with twelve CPU cores and three GPUs. ...
We let T (m × n) denote the number of floating operations to compute a matrix of size m × n. fcore and fgpu denote the speed (i.e., flop/s) on a CPU core and a GPU, respectively. ...
doi:10.1145/2312005.2312025
dblp:conf/spaa/SongD12
fatcat:hf5xby6w4bcczef53rcp3gfi5y
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
2012
Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multi-GPU systems to support dense matrix computations efficiently. ...
Our approach is designed for achieving four objectives: a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. ...
Acknowledgments We are grateful to Bonnie Brown and Samuel Crawford for their assistance with this paper. ...
doi:10.1145/2304576.2304625
dblp:conf/ics/SongTD12
fatcat:qdntltmtmfavxdcf2i4zonzq74
« Previous
Showing results 1 — 15 out of 21 results