A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is `application/pdf`

.

## Filters

##
###
The Model Coupling Toolkit
[chapter]

2011
*
Earth System Modelling - Volume 3
*

Regridding is implemented as sparse
matrix-

doi:10.1007/978-3-642-23360-9_3
fatcat:4jwqxtmnabcljbsbymdkironr4
*vector*multiplication with matrix elements*computed*off-line Layered Design: Listed from Lowest to Highest Level: Vendor utilities (MPI, BLAS,*shared*... the coupler is only*shared*-*memory**parallel*US Climate Researchers have faced problems exploiting microprocessor- based*parallel*systems to achieve high performance for their applications This situation ...##
###
Selecting Multiple Order Statistics with a Graphics Processing Unit

2016
*
ACM Transactions on Parallel Computing
*

For large

doi:10.1145/2948974
fatcat:3aieivr5gnhenmzjei4udky5au
*vectors*, bucketMultiSelect returns thousands of order statistics in less time than*sorting*the*vector*while typically using less*memory*. ... When the*sorted*version of the*vector*is not needed, bucketMultiSelect significantly reduces*computation*time by eliminating a large portion of the unnecessary operations involved in*sorting*. ... Subroutine 2 then assigns the values in the*vector*to an appropriate bucket while fully utilizing*shared**memory*and the independence of the*computational*, linear assignment for a full*parallelization*. ...##
###
A high-performance sorting algorithm for multicore single-instruction multiple-data processors

2011
*
Software, Practice & Experience
*

*on*PowerPC 970MP when

*sorting*32 million random 32-bit

*integers*. ... The

*computational*complexity for both the combsort and our

*vectorized*combsort is O (N•log(N) ) ‡

*on*average, and O(N 2 ) in the worst case when

*sorting*N elements. ... For example, the in-core

*sorting*phase still

*shares*more than 40% of the total

*computation*time for

*sorting*8 billion 32-bit

*integers*, or 32 GB. out-of-core merging phase increased with increasing numbers ...

##
###
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

2007
*
Parallel Architecture and Compilation Techniques (PACT), Proceedings of the International Conference on
*

In this paper, we propose a new

doi:10.1109/pact.2007.4336211
fatcat:m27r5taoajec3fbgkryac4cf3y
*parallel**sorting*algorithm, called Aligned-Access*sort*(AA-*sort*), for*shared*-*memory*multi processors. The AA-*sort*algorithm takes advantage of SIMD instructions. ... 970MP when*sorting*32 M of random 32-bit*integers*. ... Step 1*sorts*four values in each*vector**integer*va[i]. ...##
###
A simple, fast parallel implementation of Quicksort and its performance evaluation on SUN Enterprise 10000

2003
*
Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings.
*

This paper looks into the behavior of a simple, fine-grain

doi:10.1109/empdp.2003.1183613
dblp:conf/pdp/TsigasZ03
fatcat:m4rvobv6dbe7fhjahahq32cxzq
*parallel*extension of Quicksort for cache-coherent*shared*address space multiprocessors. ... Quicksort has many nice properties: i) it is fast and general purpose; it is widely believed that Quicksort is the fastest general-purpose*sorting*algorithm,*on*average, and for a large number of elements ... We thank Carl Hallen and Andy Polyakov from our Supercomputing Center, for their help*on*our inconvenient requests for exclusive use. ...##
###
Parallel out-of-core sorting and fast accesses to disks

2005
*
International Journal of High Performance Computing and Networking
*

Keywords: out of core,

doi:10.1504/ijhpcn.2005.008035
fatcat:bbdjubqlarbydberor56zvetme
*parallel**sorting*algorithms; performance evaluation and modelling of*parallel**integer**sorting*algorithms;*sorting*by regular sampling and by over partitioning; data distribution; ... We derive a new*parallel**sorting*algorithm that is adapted to the READ 2 interface. The expected gain of using READ 2 is compared to the measured gain for*one*external*sorting*implementation. ... ., 1998) or some Virtual*Shared**Memory*implementation based*on*remote DMA, allow nodes to*share**memory*. ...##
###
Scan Primitives for GPU Computing
[article]

2007
*
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04
*

Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-

doi:10.2312/eggh/eggh07/097-106
fatcat:zbhoiatqsfazzdizmjs7yrpuku
*vector*multiply, and analyze the performance of the scan primitives, several*sort*algorithms that use the scan ... The scan primitives are powerful, general-purpose data-*parallel*primitives that are building blocks for a broad range of applications. ... For a*sort*of 4M 32-bit*integers*, runtimes follow. ...##
###
Parallelization of GSL: Architecture, Interfaces, and Programming Models
[chapter]

2004
*
Lecture Notes in Computer Science
*

In this paper we present our efforts towards the design and development of a

doi:10.1007/978-3-540-30218-6_31
fatcat:eipkmkckzvbfzeubqno5x6t5iu
*parallel*version of the Scientific Library from GNU using MPI and OpenMP. ... Our approach, though being a general high-level proposal, achieves for these two particular examples a performance close to that obtained by an ad hoc*parallel*programming implementation. ... User Level Programming Model Level Sequential*Shared*-*memory*Distributed-*memory*Hybrid fscanf() fscanf() gsl dmrd fscanf() gsl hs fscanf() gsl dmdd fscanf() gsl*sort**vector*() gsl sm*sort**vector*...##
###
Accelerating Iterative SpMV for the Discrete Logarithm Problem Using GPUs
[chapter]

2015
*
Lecture Notes in Computer Science
*

This central operation can be accelerated

doi:10.1007/978-3-319-16277-5_2
fatcat:442ccuiv2baqhjkfqsygdc67pm
*on*GPUs using specific*computing*models and addressing patterns, which increase the arithmetic intensity while reducing irregular*memory*accesses. ... In this work, we investigate the implementation of SpMV kernels*on*NVIDIA GPUs, for several representations of the sparse matrix in*memory*. ... Each thread*computes*its partial result in*shared**memory*, then a*parallel*reduction in*shared**memory*is required to combine the per-thread results (denoted reduction_csr_v() in Algorithm 2). ...##
###
Accelerating Iterative SpMV for Discrete Logarithm Problem Using GPUs
[article]

2014
*
arXiv
*
pre-print

This central operation can be accelerated

arXiv:1209.5520v4
fatcat:yucl3rzthfarxlbfxuhwz4tcdm
*on*GPUs using specific*computing*models and addressing patterns, which increase the arithmetic intensity while reducing irregular*memory*accesses. ... In this work, we investigate the implementation of SpMV kernels*on*NVIDIA GPUs, for several representations of the sparse matrix in*memory*. ... Each thread*computes*its partial result in*shared**memory*, then a*parallel*reduction in*shared**memory*is required to combine the per-thread results (denoted reduction_csr_v() in Algorithm 2). ...##
###
Bind: a Partitioned Global Workflow Parallel Programming Model
[article]

2016
*
arXiv
*
pre-print

*computing*software targeting heterogeneous distributed many-core architectures. ... High Performance

*Computing*is notorious for its long and expensive software development cycle. ...

*Sorting*

*integers*using Bind's MapReduce KVPairs<int, doc_type>(local_map) .map([](int k, std::

*vector*<doc_type>& docs) -> std::

*vector*<std::pair<key_type, value_type>> { std::

*vector*<std::pair<int, value_type ...

##
###
An orthogonal multiprocessor for parallel scientific computations

1989
*
IEEE transactions on computers
*

This OMP architecture has a simplified busing structure and partially

doi:10.1109/12.8729
fatcat:sjzxxz4z2rdbpf4ulnwvo3ea4y
*shared**memory*, which compares very favorably over fully*shared*-*memory*multiprocessors using crossbar switch, multiple buses, or multistage ...*Parallel*algorithms being mapped include matrix arithmetic, linear system solver, FFT, array*sorting*, linear programming, and*parallel*PDE solutions. ...*Parallel**sorting*methods are also implementable*on*a meshconnected*computer*(MCC), that has n2 processors. The MCC has a time complexity of O(n) in*sorting*n 2 numbers [29], [37] . ...##
###
A synthesis of parallel out-of-core sorting programs on heterogeneous clusters

2003
*
CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings.
*

Since most common

doi:10.1109/ccgrid.2003.1199355
dblp:conf/ccgrid/CerinFJ03
fatcat:hmto5nws4naszoxnc44pfoghtu
*sort*algorithms assume high-speed random access to all intermediate*memory*, they are unsuitable if the values to be*sorted*don't fit in main*memory*. ... The paper considers the problem of*parallel*external*sorting*in the context of a form of heterogeneous clusters. ... External*Parallel**Sorting**Parallel**sorting*algorithms under the framework of out-of-core*computation*is not new. ...##
###
Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures
[chapter]

2004
*
Lecture Notes in Computer Science
*

We compare a number of programming languages (Pthreads, OpenMP, MPI, UPC, Global Arrays)

doi:10.1007/978-3-540-24644-2_13
fatcat:js24djykkfhohk2gmc2m4dmbdu
*on*both*shared*and distributed-*memory*architectures. ... We evaluate the impact of programming language features*on*the performance of*parallel*applications*on*modern*parallel*architectures, particularly for the demanding case of sparse*integer*codes. ... Our conclusion is that*parallel*applications requiring fine-grain accesses achieve poor performance*on*clusters regardless of the programming paradigm or language feature used, because the amount of inherent ...##
###
SIMD- and cache-friendly algorithm for sorting an array of structures

2015
*
Proceedings of the VLDB Endowment
*

Recently, multiway mergesort implemented with SIMD instructions has been used as a high-performance in-

doi:10.14778/2809974.2809988
fatcat:zdhoyg6v45edbcthbe7x7kmrjq
*memory**sorting*algorithm for*sorting**integer*values. ... , then rearrange the records based*on*the*sorted*key-index pairs. ... We also describe our techniques to increase data*parallelism*within*one*SIMD instruction by using 32-bit*integers*as the intermediate*integers*instead of using 64-bit*integers*. ...
« Previous

*Showing results 1 — 15 out of 19,070 results*