571 Hits in 7.7 sec

A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems [article]

Mathias Jacquelin, Lin Lin, Weile Jia, Yonghua Zhao, Chao Yang
2016 arXiv   pre-print
In this paper, we describe the left-looking variant of the selected inversion algorithm, and based on task parallel method, present an efficient implementation of the algorithm for shared memory machines  ...  Given a sparse matrix A, the selected inversion algorithm is an efficient method for computing certain selected elements of A^-1.  ...  Left-looking selected inversion is able to leverage the parallelism offered by modern multicore and manycore processors in an efficient way.  ... 
arXiv:1604.02528v1 fatcat:l6suxxuhincklftlfinq7igimu

Parallel implementation of RX anomaly detection on multi-core processors: impact of data partitioning strategies

Jose M. Molero, Ester M. Garzón, Inmaculada García, Antonio Plaza, Bormin Huang, Antonio J. Plaza
2011 High-Performance Computing in Remote Sensing  
This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of  ...  Anomaly detection is an important task for remotely sensed hyperspectral data exploitation.  ...  Funding from the Spanish Ministry of Science and Innovation (HYPERCOMP/EODIX project, reference AYA2008-05965-C04-02) and from Junta de Extremadura (PRI09A110 and GR10035 projects) are also very gratefully  ... 
doi:10.1117/12.897388 fatcat:bktzh4f3rjhqjky6lx4rip2kzq

Exploiting Parallelism by Data Dependency Elimination: A Case Study of Circuit Simulation Algorithms

Wei Wu, Fang Gong, Rahul Krishnan, Hao Yu, Lei He
2013 IEEE design & test  
As a result, the algorithms for circuit simulation cannot be effectively parallelized by simply unfolding "for" loops into parallel code.  ...  For example, board-block-diagonal (BBD) matrix formulation is deployed for the sparse MNA matrix with inverseinductance [4]; fast-multiple-method (FMM) formulation is deployed for capacitance extraction  ...  The basic idea of parallelizing Algorithm 1 is to unfold the N tasks (iterations) in the outer for loop. However, strong dependency can be identified among these tasks.  ... 
doi:10.1109/mdt.2012.2226201 fatcat:ff5h4qyj45cinbloyhupsnf3ua

A Concurrent Object-Oriented Approach to the Eigenproblem Treatment in Shared Memory Multicore Environments [chapter]

Alfonso Niño, Camelia Muñoz-Caro, Sebastián Reyes
2011 Lecture Notes in Computer Science  
We also find that a reasonable upper limit for a "small" dense matrix to be treated in actual processors is in the interval 10000-30000.  ...  This work presents an object-oriented approach to the concurrent computation of eigenvalues and eigenvectors in real symmetric and Hermitian matrices on present memory shared multicore systems.  ...  This work has been co-financed by FEDER funds and the Consejería de Educación y Ciencia de la Junta de Comunidades de Castilla-La Mancha (grant # PBI08-0008).  ... 
doi:10.1007/978-3-642-21928-3_46 fatcat:okmfdivovzefveuij4blh3xaq4

Design of the H264 application and Implementation on Heterogeneous Architectures

Chahrazed Adda, Abou Elhassen
2017 International Journal of Computer Applications  
Our approach is based on hybrid partitioning that combines both functional and data partitioning which is applied to find the most suitable processors (CPU or GPU) regarding the execution time.  ...  While the new encoding and decoding processes are similar to many previous standards, the new standard includes a number of new features and thus requires much more computation than most existing standards  ...  Inverse prediction task can be done in parallel with CAVLD, inverse quantization and inverse transform (task Parallelism).  ... 
doi:10.5120/ijca2017916056 fatcat:ciluiuzu55gctnd23iifvqagqy

Performance evaluation of R with Intel Xeon Phi coprocessor

Yaakoub El-Khamra, Niall Gaffney, David Walling, Eric Wernert, Weijia Xu, Hui Zhang
2013 2013 IEEE International Conference on Big Data  
The performance gains through parallelization increases as the data size increases, a promising result for adopting R for big data problem in the future.   ...  There are up to five times speedup gains from using MKL with a 16 cores without modification to the existing code for certain computing tasks.  ...  Those tasks include cross product between matrices, linear regression, matrix decomposition, computing inverse and determinant of a matrix.  ... 
doi:10.1109/bigdata.2013.6691695 dblp:conf/bigdataconf/KhamraGWWXZ13 fatcat:vansjz4x5bddfnkae46xh4n7py

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

2016 Supercomputing Frontiers and Innovations  
Of interest is the evolution of the programming models for DLA libraries -in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore  ...  hardware trends and ease of programming high-performance numerical software that current applications need -in order to motivate work and future directions for the next generation of parallel programming  ...  This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission  ... 
doi:10.14529/jsfi150405 fatcat:avnmwu4dozdmjksknrlznhpv7u

LU factorization for accelerator-based systems

Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Stanimire Tomov
2011 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA)  
the kernels for two different machines composed of multiple recent NVIDIA Tesla S1070 (four GPUs total) and Fermi-based S2050 GPUs (three GPUs total), respectively.  ...  Multicore architectures enhanced with multiple GPUs are likely to become mainstream High Performance Computing (HPC) platforms in a near future.  ...  CONCLUSION We have presented the design and implementation of a new hybrid algorithm for performing the tile LU factorization on a multicore node enhanced with multiple GPUs.  ... 
doi:10.1109/aiccsa.2011.6126599 dblp:conf/aiccsa/AgulloADFLLT11 fatcat:d4ekr755wncsng4kkntrmgrbva

Parallelization of DQMC simulation for strongly correlated electron systems

Che-Rung Lee, I-Hsin Chung, Zhaojun Bai
2010 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)  
From coarse-grained parallel Markov chain and task decompositions to fine-grained parallelization methods for matrix computations and Green's function calculations, the HGP scheme explores the parallelism  ...  We extend previous work with novelty by presenting a hybrid granularity parallelization (HGP) scheme that combines algorithmic and implementation techniques to speed up the DQMC simulation.  ...  The first author would like to acknowledge National Science Council of Taiwan for the support under the grant NSC98-2218-E-007-006-MY3, and National Center for Highperformance Computing for using the computing  ... 
doi:10.1109/ipdps.2010.5470484 dblp:conf/ipps/LeeCB10 fatcat:b6mrhqxbbvedjggrdfraytt6j4

Alya: Computational Solid Mechanics for Supercomputers

E. Casoni, A. Jérusalem, C. Samaniego, B. Eguzkitza, P. Lafortune, D. D. Tjahjanto, X. Sáez, G. Houzeaux, M. Vázquez
2014 Archives of Computational Methods in Engineering  
Hybrid parallelization exploits the thread-level parallelism of multicore architectures, com-bining MPI tasks with OpenMP threads.  ...  Hybrid parallelization is specially well suited for the current trend of supercomputers, namely large clusters of multicores.  ...  The domain decomposition strategy implemented only uses parallelism at task level, which Multicore architecture for an hybrid openMP/MPI framework is provided by MPI.  ... 
doi:10.1007/s11831-014-9126-8 fatcat:ee43vkfeizgsha4tzo2kbizmqe

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Ioan Hadade, Luca di Mare
2016 Computer Physics Communications  
Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism.  ...  The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors.  ...  The increase in performance for the intrinsics version on the multicore processors is due to the manual inner loop unrolling when assembling fluxes which allows for more efficient instruction level parallelism  ... 
doi:10.1016/j.cpc.2016.04.006 fatcat:ndhpdqbuonhwrhcfnumglj5sge

A survey of power and energy efficient techniques for high performance numerical linear algebra operations

Li Tan, Shashank Kothapalli, Longxiang Chen, Omar Hussaini, Ryan Bissiri, Zizhong Chen
2014 Parallel Computing  
, and summarize state-of-the-art techniques for achieving power and energy efficiency in each category individually.  ...  We summarize commonly deployed power management techniques for reducing power and energy consumption in high performance computing systems by presenting power and energy models and two fundamental types  ...  [90] investigated the trade-off between execution time and energy costs of task-parallel Cholesky and LU factorizations on a hybrid CPU-GPU platform. Anzt et al.  ... 
doi:10.1016/j.parco.2014.09.001 fatcat:twdkr2hrizebvglto6dwd7jqem

A Fast Selected Inversion Algorithm for Green's Function Calculation in Many-Body Quantum Monte Carlo Simulations

Chengming Jiang, Zhaojun Bai, Richard Scalettar
2016 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)  
In this paper, we describe a fast selected inversion (FSI) algorithm for computing selected entries of Green's functions and present a parallel implementation using hybrid MPI/OpenMP programming.  ...  factorization; (3) using the block entries of the inverse of the reduced block pcyclic matrix as seeds to rapidly form the selected inversion in parallel.  ...  To take advantage of both distributed memory and multicore shared memory architecture, it goes naturally to employ hybrid MPI/OpenMP parallelism that uses MPI for message passing and OpenMP for frequently  ... 
doi:10.1109/ipdps.2016.69 dblp:conf/ipps/JiangBS16 fatcat:gfowjar25vec5iilcxmjlybg4u

Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators

Azzam Haidar, Yulu Jia, Piotr Luszczek, Stanimire Tomov, Asim YarKhan, Jack Dongarra
2015 Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '15  
For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU.  ...  We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available  ...  the Department of Energy, and the NVIDIA and Intel Corporations.  ... 
doi:10.1145/2832080.2832085 dblp:conf/sc/HaidarJLTYD15 fatcat:ppxzxzbmyvc4rjc6qiarh4kaly

Scalable NUMA-Aware Wilson-Dirac on Supercomputers

Claude Tadonki
2017 2017 International Conference on High Performance Computing & Simulation (HPCS)  
Designing efficient LQCD codes on modern (mostly hybrid) supercomputers requires to efficiently exploit all available levels of parallelism including accelerators.  ...  We reach nearly optimal performances on a single core and a significant scalability improvement on several NUMA nodes.  ...  Thanks to Christine Einsenbeis from INRIA for our regular discussions about LQCD implementations, and to my PhD student Adilla Susungi for the same about NUMA considerations.  ... 
doi:10.1109/hpcs.2017.56 dblp:conf/ieeehpcs/Tadonki17 fatcat:5t57tywdsnhunpxwytullhkeia
« Previous Showing results 1 — 15 out of 571 results