19,070 Hits in 5.8 sec

The Model Coupling Toolkit [chapter]

Robert Jacob, Jay Larson
2011 Earth System Modelling - Volume 3  
Regridding is implemented as sparse matrix-vector multiplication with matrix elements computed off-line Layered Design: Listed from Lowest to Highest Level: Vendor utilities (MPI, BLAS, shared  ...  the coupler is only shared-memory parallel US Climate Researchers have faced problems exploiting microprocessor- based parallel systems to achieve high performance for their applications This situation  ... 
doi:10.1007/978-3-642-23360-9_3 fatcat:4jwqxtmnabcljbsbymdkironr4

Selecting Multiple Order Statistics with a Graphics Processing Unit

Jeffrey D. Blanchard, Erik Opavsky, Emircan Uysaler
2016 ACM Transactions on Parallel Computing  
For large vectors, bucketMultiSelect returns thousands of order statistics in less time than sorting the vector while typically using less memory.  ...  When the sorted version of the vector is not needed, bucketMultiSelect significantly reduces computation time by eliminating a large portion of the unnecessary operations involved in sorting.  ...  Subroutine 2 then assigns the values in the vector to an appropriate bucket while fully utilizing shared memory and the independence of the computational, linear assignment for a full parallelization.  ... 
doi:10.1145/2948974 fatcat:3aieivr5gnhenmzjei4udky5au

A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, Toshio Nakatani
2011 Software, Practice & Experience  
on PowerPC 970MP when sorting 32 million random 32-bit integers.  ...  The computational complexity for both the combsort and our vectorized combsort is O (N•log(N) ) ‡ on average, and O(N 2 ) in the worst case when sorting N elements.  ...  For example, the in-core sorting phase still shares more than 40% of the total computation time for sorting 8 billion 32-bit integers, or 32 GB. out-of-core merging phase increased with increasing numbers  ... 
doi:10.1002/spe.1102 fatcat:nddnypczinhpfjwybmkgkmtx6i

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, Toshio Nakatani
2007 Parallel Architecture and Compilation Techniques (PACT), Proceedings of the International Conference on  
In this paper, we propose a new parallel sorting algorithm, called Aligned-Access sort (AA-sort), for shared-memory multi processors. The AA-sort algorithm takes advantage of SIMD instructions.  ...  970MP when sorting 32 M of random 32-bit integers.  ...  Step 1 sorts four values in each vector integer va[i].  ... 
doi:10.1109/pact.2007.4336211 fatcat:m27r5taoajec3fbgkryac4cf3y

A simple, fast parallel implementation of Quicksort and its performance evaluation on SUN Enterprise 10000

P. Tsigas, Yi Zhang
2003 Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings.  
This paper looks into the behavior of a simple, fine-grain parallel extension of Quicksort for cache-coherent shared address space multiprocessors.  ...  Quicksort has many nice properties: i) it is fast and general purpose; it is widely believed that Quicksort is the fastest general-purpose sorting algorithm, on average, and for a large number of elements  ...  We thank Carl Hallen and Andy Polyakov from our Supercomputing Center, for their help on our inconvenient requests for exclusive use.  ... 
doi:10.1109/empdp.2003.1183613 dblp:conf/pdp/TsigasZ03 fatcat:m4rvobv6dbe7fhjahahq32cxzq

Parallel out-of-core sorting and fast accesses to disks

Christophe Cerin, Olivier Cozette, Gil Utard, Hazem Fkaier, Mohamed Jemni
2005 International Journal of High Performance Computing and Networking  
Keywords: out of core, parallel sorting algorithms; performance evaluation and modelling of parallel integer sorting algorithms; sorting by regular sampling and by over partitioning; data distribution;  ...  We derive a new parallel sorting algorithm that is adapted to the READ 2 interface. The expected gain of using READ 2 is compared to the measured gain for one external sorting implementation.  ...  ., 1998) or some Virtual Shared Memory implementation based on remote DMA, allow nodes to share memory.  ... 
doi:10.1504/ijhpcn.2005.008035 fatcat:bbdjubqlarbydberor56zvetme

Scan Primitives for GPU Computing [article]

Shubhabrata Sengupta, Mark Harris, Yao Zhang, John D. Owens
2007 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04  
Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan  ...  The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications.  ...  For a sort of 4M 32-bit integers, runtimes follow.  ... 
doi:10.2312/eggh/eggh07/097-106 fatcat:zbhoiatqsfazzdizmjs7yrpuku

Parallelization of GSL: Architecture, Interfaces, and Programming Models [chapter]

J. Aliaga, F. Almeida, J. M. Badía, S. Barrachina, V. Blanco, M. Castillo, U. Dorta, R. Mayo, E. S. Quintana, G. Quintana, C. Rodríguez, F. de Sande
2004 Lecture Notes in Computer Science  
In this paper we present our efforts towards the design and development of a parallel version of the Scientific Library from GNU using MPI and OpenMP.  ...  Our approach, though being a general high-level proposal, achieves for these two particular examples a performance close to that obtained by an ad hoc parallel programming implementation.  ...  User Level Programming Model Level Sequential Shared-memory Distributed-memory Hybrid fscanf() fscanf() gsl dmrd fscanf() gsl hs fscanf() gsl dmdd fscanf() gsl sort vector() gsl sm sort vector  ... 
doi:10.1007/978-3-540-30218-6_31 fatcat:eipkmkckzvbfzeubqno5x6t5iu

Accelerating Iterative SpMV for the Discrete Logarithm Problem Using GPUs [chapter]

Hamza Jeljeli
2015 Lecture Notes in Computer Science  
This central operation can be accelerated on GPUs using specific computing models and addressing patterns, which increase the arithmetic intensity while reducing irregular memory accesses.  ...  In this work, we investigate the implementation of SpMV kernels on NVIDIA GPUs, for several representations of the sparse matrix in memory.  ...  Each thread computes its partial result in shared memory, then a parallel reduction in shared memory is required to combine the per-thread results (denoted reduction_csr_v() in Algorithm 2).  ... 
doi:10.1007/978-3-319-16277-5_2 fatcat:442ccuiv2baqhjkfqsygdc67pm

Accelerating Iterative SpMV for Discrete Logarithm Problem Using GPUs [article]

Hamza Jeljeli
2014 arXiv   pre-print
This central operation can be accelerated on GPUs using specific computing models and addressing patterns, which increase the arithmetic intensity while reducing irregular memory accesses.  ...  In this work, we investigate the implementation of SpMV kernels on NVIDIA GPUs, for several representations of the sparse matrix in memory.  ...  Each thread computes its partial result in shared memory, then a parallel reduction in shared memory is required to combine the per-thread results (denoted reduction_csr_v() in Algorithm 2).  ... 
arXiv:1209.5520v4 fatcat:yucl3rzthfarxlbfxuhwz4tcdm

Bind: a Partitioned Global Workflow Parallel Programming Model [article]

Alex Kosenkov, Matthias Troyer
2016 arXiv   pre-print
computing software targeting heterogeneous distributed many-core architectures.  ...  High Performance Computing is notorious for its long and expensive software development cycle.  ...  Sorting integers using Bind's MapReduce KVPairs<int, doc_type>(local_map) .map([](int k, std::vector<doc_type>& docs) -> std::vector<std::pair<key_type, value_type>> { std::vector<std::pair<int, value_type  ... 
arXiv:1606.04830v1 fatcat:6knujtlce5amnczddui2jysiua

An orthogonal multiprocessor for parallel scientific computations

K. Hwang, P.-S. Tseng, D. Kim
1989 IEEE transactions on computers  
This OMP architecture has a simplified busing structure and partially shared memory, which compares very favorably over fully shared-memory multiprocessors using crossbar switch, multiple buses, or multistage  ...  Parallel algorithms being mapped include matrix arithmetic, linear system solver, FFT, array sorting, linear programming, and parallel PDE solutions.  ...  Parallel sorting methods are also implementable on a meshconnected computer (MCC), that has n2 processors. The MCC has a time complexity of O(n) in sorting n 2 numbers [29], [37] .  ... 
doi:10.1109/12.8729 fatcat:sjzxxz4z2rdbpf4ulnwvo3ea4y

A synthesis of parallel out-of-core sorting programs on heterogeneous clusters

C. Cerin, H. Fkaier
2003 CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings.  
Since most common sort algorithms assume high-speed random access to all intermediate memory, they are unsuitable if the values to be sorted don't fit in main memory.  ...  The paper considers the problem of parallel external sorting in the context of a form of heterogeneous clusters.  ...  External Parallel Sorting Parallel sorting algorithms under the framework of out-of-core computation is not new.  ... 
doi:10.1109/ccgrid.2003.1199355 dblp:conf/ccgrid/CerinFJ03 fatcat:hmto5nws4naszoxnc44pfoghtu

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures [chapter]

Konstantin Berlin, Jun Huan, Mary Jacob, Garima Kochhar, Jan Prins, Bill Pugh, P. Sadayappan, Jaime Spacco, Chau-Wen Tseng
2004 Lecture Notes in Computer Science  
We compare a number of programming languages (Pthreads, OpenMP, MPI, UPC, Global Arrays) on both shared and distributed-memory architectures.  ...  We evaluate the impact of programming language features on the performance of parallel applications on modern parallel architectures, particularly for the demanding case of sparse integer codes.  ...  Our conclusion is that parallel applications requiring fine-grain accesses achieve poor performance on clusters regardless of the programming paradigm or language feature used, because the amount of inherent  ... 
doi:10.1007/978-3-540-24644-2_13 fatcat:js24djykkfhohk2gmc2m4dmbdu

SIMD- and cache-friendly algorithm for sorting an array of structures

Hiroshi Inoue, Kenjiro Taura
2015 Proceedings of the VLDB Endowment  
Recently, multiway mergesort implemented with SIMD instructions has been used as a high-performance in-memory sorting algorithm for sorting integer values.  ...  , then rearrange the records based on the sorted key-index pairs.  ...  We also describe our techniques to increase data parallelism within one SIMD instruction by using 32-bit integers as the intermediate integers instead of using 64-bit integers.  ... 
doi:10.14778/2809974.2809988 fatcat:zdhoyg6v45edbcthbe7x7kmrjq
« Previous Showing results 1 — 15 out of 19,070 results