A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems
[article]
2016
arXiv
pre-print
In this paper, we describe the left-looking variant of the selected inversion algorithm, and based on task parallel method, present an efficient implementation of the algorithm for shared memory machines ...
Given a sparse matrix A, the selected inversion algorithm is an efficient method for computing certain selected elements of A^-1. ...
Left-looking selected inversion is able to leverage the parallelism offered by modern multicore and manycore processors in an efficient way. ...
arXiv:1604.02528v1
fatcat:l6suxxuhincklftlfinq7igimu
Parallel implementation of RX anomaly detection on multi-core processors: impact of data partitioning strategies
2011
High-Performance Computing in Remote Sensing
This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of ...
Anomaly detection is an important task for remotely sensed hyperspectral data exploitation. ...
Funding from the Spanish Ministry of Science and Innovation (HYPERCOMP/EODIX project, reference AYA2008-05965-C04-02) and from Junta de Extremadura (PRI09A110 and GR10035 projects) are also very gratefully ...
doi:10.1117/12.897388
fatcat:bktzh4f3rjhqjky6lx4rip2kzq
Exploiting Parallelism by Data Dependency Elimination: A Case Study of Circuit Simulation Algorithms
2013
IEEE design & test
As a result, the algorithms for circuit simulation cannot be effectively parallelized by simply unfolding "for" loops into parallel code. ...
For example, board-block-diagonal (BBD) matrix formulation is deployed for the sparse MNA matrix with inverseinductance [4]; fast-multiple-method (FMM) formulation is deployed for capacitance extraction ...
The basic idea of parallelizing Algorithm 1 is to unfold the N tasks (iterations) in the outer for loop. However, strong dependency can be identified among these tasks. ...
doi:10.1109/mdt.2012.2226201
fatcat:ff5h4qyj45cinbloyhupsnf3ua
A Concurrent Object-Oriented Approach to the Eigenproblem Treatment in Shared Memory Multicore Environments
[chapter]
2011
Lecture Notes in Computer Science
We also find that a reasonable upper limit for a "small" dense matrix to be treated in actual processors is in the interval 10000-30000. ...
This work presents an object-oriented approach to the concurrent computation of eigenvalues and eigenvectors in real symmetric and Hermitian matrices on present memory shared multicore systems. ...
This work has been co-financed by FEDER funds and the Consejería de Educación y Ciencia de la Junta de Comunidades de Castilla-La Mancha (grant # PBI08-0008). ...
doi:10.1007/978-3-642-21928-3_46
fatcat:okmfdivovzefveuij4blh3xaq4
Design of the H264 application and Implementation on Heterogeneous Architectures
2017
International Journal of Computer Applications
Our approach is based on hybrid partitioning that combines both functional and data partitioning which is applied to find the most suitable processors (CPU or GPU) regarding the execution time. ...
While the new encoding and decoding processes are similar to many previous standards, the new standard includes a number of new features and thus requires much more computation than most existing standards ...
Inverse prediction task can be done in parallel with CAVLD, inverse quantization and inverse transform (task Parallelism). ...
doi:10.5120/ijca2017916056
fatcat:ciluiuzu55gctnd23iifvqagqy
Performance evaluation of R with Intel Xeon Phi coprocessor
2013
2013 IEEE International Conference on Big Data
The performance gains through parallelization increases as the data size increases, a promising result for adopting R for big data problem in the future. ...
There are up to five times speedup gains from using MKL with a 16 cores without modification to the existing code for certain computing tasks. ...
Those tasks include cross product between matrices, linear regression, matrix decomposition, computing inverse and determinant of a matrix. ...
doi:10.1109/bigdata.2013.6691695
dblp:conf/bigdataconf/KhamraGWWXZ13
fatcat:vansjz4x5bddfnkae46xh4n7py
Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
2016
Supercomputing Frontiers and Innovations
Of interest is the evolution of the programming models for DLA libraries -in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore ...
hardware trends and ease of programming high-performance numerical software that current applications need -in order to motivate work and future directions for the next generation of parallel programming ...
This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission ...
doi:10.14529/jsfi150405
fatcat:avnmwu4dozdmjksknrlznhpv7u
LU factorization for accelerator-based systems
2011
2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA)
the kernels for two different machines composed of multiple recent NVIDIA Tesla S1070 (four GPUs total) and Fermi-based S2050 GPUs (three GPUs total), respectively. ...
Multicore architectures enhanced with multiple GPUs are likely to become mainstream High Performance Computing (HPC) platforms in a near future. ...
CONCLUSION We have presented the design and implementation of a new hybrid algorithm for performing the tile LU factorization on a multicore node enhanced with multiple GPUs. ...
doi:10.1109/aiccsa.2011.6126599
dblp:conf/aiccsa/AgulloADFLLT11
fatcat:d4ekr755wncsng4kkntrmgrbva
Parallelization of DQMC simulation for strongly correlated electron systems
2010
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
From coarse-grained parallel Markov chain and task decompositions to fine-grained parallelization methods for matrix computations and Green's function calculations, the HGP scheme explores the parallelism ...
We extend previous work with novelty by presenting a hybrid granularity parallelization (HGP) scheme that combines algorithmic and implementation techniques to speed up the DQMC simulation. ...
The first author would like to acknowledge National Science Council of Taiwan for the support under the grant NSC98-2218-E-007-006-MY3, and National Center for Highperformance Computing for using the computing ...
doi:10.1109/ipdps.2010.5470484
dblp:conf/ipps/LeeCB10
fatcat:b6mrhqxbbvedjggrdfraytt6j4
Alya: Computational Solid Mechanics for Supercomputers
2014
Archives of Computational Methods in Engineering
Hybrid parallelization exploits the thread-level parallelism of multicore architectures, com-bining MPI tasks with OpenMP threads. ...
Hybrid parallelization is specially well suited for the current trend of supercomputers, namely large clusters of multicores. ...
The domain decomposition strategy implemented only uses parallelism at task level, which Multicore architecture for an hybrid openMP/MPI framework is provided by MPI. ...
doi:10.1007/s11831-014-9126-8
fatcat:ee43vkfeizgsha4tzo2kbizmqe
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code
2016
Computer Physics Communications
Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. ...
The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. ...
The increase in performance for the intrinsics version on the multicore processors is due to the manual inner loop unrolling when assembling fluxes which allows for more efficient instruction level parallelism ...
doi:10.1016/j.cpc.2016.04.006
fatcat:ndhpdqbuonhwrhcfnumglj5sge
A survey of power and energy efficient techniques for high performance numerical linear algebra operations
2014
Parallel Computing
, and summarize state-of-the-art techniques for achieving power and energy efficiency in each category individually. ...
We summarize commonly deployed power management techniques for reducing power and energy consumption in high performance computing systems by presenting power and energy models and two fundamental types ...
[90] investigated the trade-off between execution time and energy costs of task-parallel Cholesky and LU factorizations on a hybrid CPU-GPU platform. Anzt et al. ...
doi:10.1016/j.parco.2014.09.001
fatcat:twdkr2hrizebvglto6dwd7jqem
A Fast Selected Inversion Algorithm for Green's Function Calculation in Many-Body Quantum Monte Carlo Simulations
2016
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
In this paper, we describe a fast selected inversion (FSI) algorithm for computing selected entries of Green's functions and present a parallel implementation using hybrid MPI/OpenMP programming. ...
factorization; (3) using the block entries of the inverse of the reduced block pcyclic matrix as seeds to rapidly form the selected inversion in parallel. ...
To take advantage of both distributed memory and multicore shared memory architecture, it goes naturally to employ hybrid MPI/OpenMP parallelism that uses MPI for message passing and OpenMP for frequently ...
doi:10.1109/ipdps.2016.69
dblp:conf/ipps/JiangBS16
fatcat:gfowjar25vec5iilcxmjlybg4u
Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators
2015
Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '15
For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. ...
We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available ...
the Department of Energy, and the NVIDIA and Intel Corporations. ...
doi:10.1145/2832080.2832085
dblp:conf/sc/HaidarJLTYD15
fatcat:ppxzxzbmyvc4rjc6qiarh4kaly
Scalable NUMA-Aware Wilson-Dirac on Supercomputers
2017
2017 International Conference on High Performance Computing & Simulation (HPCS)
Designing efficient LQCD codes on modern (mostly hybrid) supercomputers requires to efficiently exploit all available levels of parallelism including accelerators. ...
We reach nearly optimal performances on a single core and a significant scalability improvement on several NUMA nodes. ...
Thanks to Christine Einsenbeis from INRIA for our regular discussions about LQCD implementations, and to my PhD student Adilla Susungi for the same about NUMA considerations. ...
doi:10.1109/hpcs.2017.56
dblp:conf/ieeehpcs/Tadonki17
fatcat:5t57tywdsnhunpxwytullhkeia
« Previous
Showing results 1 — 15 out of 571 results