Filters








21 Hits in 4.3 sec

Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, Robert van de Geijn
2007 Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07  
We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores.  ...  We show that this facilitates the adoption of techniques akin to dynamic scheduling and out-of-order execution usual in superscalar processors, which we name SuperMatrix Out-of-Order scheduling.  ...  We thank the other members of the FLAME team for their support.  ... 
doi:10.1145/1248377.1248397 dblp:conf/spaa/ChanQQG07 fatcat:ydn5blbqjfgijokgyjctbjbwnq

An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization [chapter]

Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Alfredo Remón, Robert A. van de Geijn
2008 Lecture Notes in Computer Science  
The SuperMatrix run-time system allows an out-of-order scheduling of operations that is transparent to the programmer.  ...  We pursue the scalable parallel implementation of the factorization of band matrices with medium to large bandwidth targeting SMP and multi-core architectures.  ...  We thank John Gilbert and Vikram Aggarwal from the University of California at Santa Barbara for granting the access to the neumann platform.  ... 
doi:10.1007/978-3-540-92859-1_21 fatcat:4jv4sopocncjpm3a665pn3jjh4

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

Gregorio Quintana-Orti, Enrique S. Quintana-Orti, Ernie Chan, Robert A. van de Geijn, Field G. Van Zee
2008 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)  
This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multi-core architectures.  ...  The SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer.  ...  Acknowledgments We thank the other members of the FLAME team for their support. This research was partially sponsored by NSF grants CCF-0540926 and CCF-0702714.  ... 
doi:10.1109/pdp.2008.37 dblp:conf/pdp/Quintana-OrtiQCGZ08 fatcat:coti2x5np5cx7ddqxs5g3otbou

Programming matrix algorithms-by-blocks for thread-level parallelism

Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. Van De Geijn, Field G. Van Zee, Ernie Chan
2009 ACM Transactions on Mathematical Software  
A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel.  ...  We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome  ...  We thank John Gilbert and Vikram Aggarwal from the University of California at Santa Barbara for granting access to the neumann platform.  ... 
doi:10.1145/1527286.1527288 fatcat:keoskxqyinc2pp47v2ly6yzski

A dependency-aware task-based programming environment for multi-core architectures

Josep M. Perez, Rosa M. Badia, Jesus Labarta
2008 2008 IEEE International Conference on Cluster Computing  
Parallel programming on SMP and multi-core architectures is hard.  ...  We present the programming environment in the context of algorithms from several domains and pinpoint its benefits compared to other approaches. We discuss its execution model and its scheduler.  ...  We also would like to thank Eduard Ayguadé for his detailed and insightful comments on a previous manuscript of this paper.  ... 
doi:10.1109/clustr.2008.4663765 dblp:conf/cluster/PerezBL08 fatcat:4uiqpbcwa5b6vpcpz2vwgavtvm

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing [article]

Linnan Wang, Wei Wu, Jianxiong Xiao, Yi Yang
2015 arXiv   pre-print
We then present BLASX: a highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task.  ...  The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems.  ...  The key insight of SuperMatrix is that a matrix can be factorized into a set of tiles. The Tomasolu algorithm [3] subsequently schedules these tiles in the out-of-order fashion.  ... 
arXiv:1510.05041v1 fatcat:wblbbwbzuzffxfkhmhzyf7moem

Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

Francisco D. Igual, Murtaza Ali, Arnon Friedmann, Eric Stotzer, Timothy Wentz, Robert A. van de Geijn
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
Take a multicore Digital Signal Processor (DSP) chip designed for cellular base stations and radio network controllers, add floating-point capabilities to support 4G networks, and out of thin air a HPC  ...  The potential for HPC is clear: It promises 128 GFLOPS (single precision) for 10 Watts; It is used in millions of network related devices and hence benefits from economies of scale; It should be simpler  ...  Previous SuperMatrix implementations The SuperMatrix runtime was originally designed and developed for SMP architectures, using OpenMP or pthreads as the underlying thread support [3] .  ... 
doi:10.1109/sc.2012.109 dblp:conf/sc/IgualAFSWG12 fatcat:gnkl5kwf2bad7axiszdfzeu2hq

Solving dense linear systems on platforms with multiple hardware accelerators

Gregorio Quintana-Ortí, Francisco D. Igual, Enrique S. Quintana-Ortí, Robert A. van de Geijn
2009 SIGPLAN notices  
Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix  ...  product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.  ...  Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational Engineering and Sciences (ICES) at UT-Austin.  ... 
doi:10.1145/1594835.1504196 fatcat:6xan7rjzkffetczdtrau3yg5w4

Solving dense linear systems on platforms with multiple hardware accelerators

Gregorio Quintana-Ortí, Francisco D. Igual, Enrique S. Quintana-Ortí, Robert A. van de Geijn
2008 Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '09  
Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix  ...  product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.  ...  Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational Engineering and Sciences (ICES) at UT-Austin.  ... 
doi:10.1145/1504176.1504196 dblp:conf/ppopp/Quintana-OrtiIQG09 fatcat:hk2uw4hnkve7lfv5yf7k733rxi

Solving "large" dense matrix problems on multi-core processors

Mercedes Marques, Gregorio Quintana-Orti, Enrique S. Quintana-Orti, Robert A. van de Geijn
2009 2009 IEEE International Symposium on Parallel & Distributed Processing  
on a fast multithreaded architecture like an SMP or multi-core computer.  ...  in and out of main memory) is no more difficult than programming high-performance implementations for the case where the matrix is in memory.  ...  Acknowledgements The researchers at the Universidad Jaime I were supported by projects CICYT TIN2005-09037-C02-02, TIN2008-06570-C04-01 and FEDER, and P1B-2007-19 of the Fundación Caixa-Castellón/Bancaixa  ... 
doi:10.1109/ipdps.2009.5161162 dblp:conf/ipps/MarquesQQG09 fatcat:iseganzxjnbq3e6wws3wwkdv3m

Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Azzam Haidar, Hatem Ltaief, Asim YarKhan, Jack Dongarra
2011 Concurrency and Computation  
The PLASMA library (Parallel Linear Algebra for Scalable Multi-core Architectures) developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity  ...  The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared-memory, multicore architectures.  ...  The SMP superscalar (SMPSs) project [5] [21] from the Barcelona Supercomputing Center is a programming environment for shared memory, multi-core architectures focused on the ease of programming, portability  ... 
doi:10.1002/cpe.1829 fatcat:ynqg2xr2wba2ri36bdwrbhc3qy

Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing

2015 Supercomputing Frontiers and Innovations  
The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks.  ...  Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a twofold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while  ...  The rigid panel-update sequence, previously described in Section 3.1, is now replaced by an out-of-order task execution flow, where computational tasks operating on tiles from different loop iterations  ... 
doi:10.14529/jsfi150103 fatcat:l7mujkltgzh25oy66xnyz63iei

Hierarchical Task-Based Programming With StarSs

Judit Planas, Rosa M. Badia, Eduard Ayguadé, Jesus Labarta, Jack Dongarra, Bernard Tourancheau
2009 The international journal of high performance computing applications  
Programming models for multicore and many-core systems are listed as one of the main challenges in the near future for computing research.  ...  The preliminary results obtained when executing a matrix multiplication and a Cholesky factorization show the viability and potential of the approach and the current issues raised.  ...  Since 1981 he has been lecturing on computer architecture, operating systems, computer networks and performance evaluation.  ... 
doi:10.1177/1094342009106195 fatcat:ykbrmwbis5hxhpqskolh4zd5a4

A scalable framework for heterogeneous GPU-based clusters

Fengguang Song, Jack Dongarra
2012 Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures - SPAA '12  
By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [25] using  ...  100 nodes, each with twelve CPU cores and three GPUs.  ...  We let T (m × n) denote the number of floating operations to compute a matrix of size m × n. fcore and fgpu denote the speed (i.e., flop/s) on a CPU core and a GPU, respectively.  ... 
doi:10.1145/2312005.2312025 dblp:conf/spaa/SongD12 fatcat:hf5xby6w4bcczef53rcp3gfi5y

Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Fengguang Song, Stanimire Tomov, Jack Dongarra
2012 Proceedings of the 26th ACM international conference on Supercomputing - ICS '12  
We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multi-GPU systems to support dense matrix computations efficiently.  ...  Our approach is designed for achieving four objectives: a high degree of parallelism, minimized synchronization, minimized communication, and load balancing.  ...  Acknowledgments We are grateful to Bonnie Brown and Samuel Crawford for their assistance with this paper.  ... 
doi:10.1145/2304576.2304625 dblp:conf/ics/SongTD12 fatcat:qdntltmtmfavxdcf2i4zonzq74
« Previous Showing results 1 — 15 out of 21 results