Filters








50 Hits in 3.5 sec

Fast parallel sorting under LogP: experience with the CM-5

A.C. Dusseau, D.E. Culler, K.E. Schauser, R.P. Martin
1996 IEEE Transactions on Parallel and Distributed Systems  
The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention.  ...  In this paper, the LogP model is used to analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort).  ...  When this dependency exists, a parallel prefix can be used for each row of bins, but the 2 r scans must be done sequentially. For a large radix, this is inefficient.  ... 
doi:10.1109/71.532111 fatcat:7u4bnpp6xze4lngph2mxv3tdr4

Preserving time in large-scale communication traces

Prasun Ratn, Frank Mueller, Bronis R. de Supinski, Martin Schulz
2008 Proceedings of the 22nd annual international conference on Supercomputing - ICS '08  
We show that our representations capture sufficient information to enable what-if explorations of architectural variations and analysis for path-based timing irregularities while not requiring excessive  ...  Our lossless traces are orders of magnitude smaller, if not near constant size, regardless of the number of nodes while preserving timing information suitable for application tuning or assessing requirements  ...  For applications with larger traces, we observe an increase in runtime similar to flat tracing for applications with good intra-node compression (BT and LU) and only slightly higher with less successful  ... 
doi:10.1145/1375527.1375537 dblp:conf/ics/RatnMSS08 fatcat:55tzotwxvbhwjez26f5kywvrri

Midpoint routing algorithms for Delaunay triangulations

Weisheng Si, Albert Y. Zomaya
2010 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)  
-resulting in more than 4× run time improvement for both examined applications.  ...  Scheduling Algorithms for Linear Workflow Optimization Abstract Pipelined workflows are a popular programming paradigm for parallel applications.  ...  They combine ideas to reduce communication from communication avoiding algorithms with asynchronism and dynamic task scheduling.  ... 
doi:10.1109/ipdps.2010.5470471 dblp:conf/ipps/SiZ10 fatcat:yuchdc4zp5borm5vs7j4rqgmzy

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data

Weifeng Liu, Brian Vinter
2014 2014 IEEE 28th International Parallel and Distributed Processing Symposium  
General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem.  ...  performance and relative speedups on a benchmark suite composed of 23 matrices with diverse sparsity structures.  ...  ACKNOWLEDGMENT The authors would like to thank Jianbin Fang at the Delft University of Technology for supplying access to the machine with the nVidia GeForce GTX Titan GPU.  ... 
doi:10.1109/ipdps.2014.47 dblp:conf/ipps/0002V14 fatcat:jcspkoixlne2tg2eszoqz2q7ea

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors

Weifeng Liu, Brian Vinter
2015 Journal of Parallel and Distributed Computing  
General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem.  ...  Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers excellent absolute performance and relative speedups on various benchmarks multiplying matrices with diverse sparsity  ...  Acknowledgments The authors would like to thank Jianbin Fang at the Delft University of Technology for supplying access to the machine with the Intel Xeon CPU.  ... 
doi:10.1016/j.jpdc.2015.06.010 fatcat:uwkmfen7zzdpbiu2gruruefkme

$\textrm{GF}(2^m)$ Finite-Field Multipliers with Reduced Activity Variations [chapter]

Danuta Pamula, Arnaud Tisserand
2012 Lecture Notes in Computer Science  
In this work, we present GF(2 m ) multipliers with reduced activity variations for asymmetric cryptography. Useful activity of typical multiplication algorithms is evaluated.  ...  It represents the mathematical power for each frequency bin and Y-axis uses the same logarithmic scale for all versions.  ...  p(i) (power for frequency bin i).  ... 
doi:10.1007/978-3-642-31662-3_11 fatcat:bxdr7zzfgfafliluucqwn7qiee

SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance [chapter]

Guido Juckeland, William Brantley, Sunita Chandrasekaran, Barbara Chapman, Shuai Che, Mathew Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-Mei W. Hwu, Huian Li, Matthias S. Müller (+12 others)
2015 Lecture Notes in Computer Science  
Users often find it difficult to characterize and understand the performance advantage of such accelerators for their applications.  ...  The new benchmark comprises two suites of applications written in OpenCL and OpenACC and measures the performance of accelerators with respect to a reference platform.  ...  The authors thank Cloyce Spradling for his work on the SPEC harness as well as the SPEC POWER group for their work on enabling the integration of power measurements into other SPEC suites.  ... 
doi:10.1007/978-3-319-17248-4_3 fatcat:wcdquz4gqffsrihtu3olf5nuty

STXXL: standard template library for XXL data sets

R. Dementiev, L. Kettner, P. Sanders
2008 Software, Practice & Experience  
With virtual memory the application does not know where its data are located: in the main memory or in the swap file.  ...  I/O-efficient algorithms and models The operating system cannot adapt to complicated access patterns of applications dealing with massive data sets.  ...  I/O and communication can be automatically overlapped with computation stages by the scheduler of the FG environment.  ... 
doi:10.1002/spe.844 fatcat:j3ngetudlvacxdwsa6uldcidra

Stxxl: Standard Template Library for XXL Data Sets [chapter]

Roman Dementiev, Lutz Kettner, Peter Sanders
2005 Lecture Notes in Computer Science  
With virtual memory the application does not know where its data are located: in the main memory or in the swap file.  ...  I/O-efficient algorithms and models The operating system cannot adapt to complicated access patterns of applications dealing with massive data sets.  ...  I/O and communication can be automatically overlapped with computation stages by the scheduler of the FG environment.  ... 
doi:10.1007/11561071_57 fatcat:mc4vj65kjfe7jgt52w4goal2z4

A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

Ariful Azad, Aydin Buluc
2017 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)  
It performs well on diverse matrices and vectors with heterogeneous sparsity patterns.  ...  We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse.  ...  We thank the anonymous reviewers for correctly pointing out to corner cases in our analysis and making the final paper better.  ... 
doi:10.1109/ipdps.2017.76 dblp:conf/ipps/AzadB17 fatcat:g45fmpiw2fglnc3vofc62s7ggu

Rapid digital architecture design of orthogonal matching pursuit

Benjamin Knoop, Jochen Rust, Sebastian Schmale, Dagmar Peters-Drolshagen, Steffen Paul
2016 2016 24th European Signal Processing Conference (EUSIPCO)  
For instance, a complex-valued digital architecture for the Orthogonal Matching Pursuit (OMP) algorithm with rank-1 updating has successfully been implemented and tested, which can be utilised for the  ...  Throughout this dissertation, signal processing applications from the field of Compressed Sensing (CS) will illustrate the efficacy of the RDAM.  ...  Last but not least, the increased computational complexity of signal processing applications had further raised the need for improved electronic systemlevel (ESL) tools meanwhile.  ... 
doi:10.1109/eusipco.2016.7760570 dblp:conf/eusipco/KnoopRSPP16 fatcat:bo3c4fo6njeo3hjjsnwo72jbwe

Implementation of Fog computing for reliable E-health applications

Razvan Craciunescu, Albena Mihovska, Mihail Mihaylov, Sofoklis Kyriazakos, Ramjee Prasad, Simona Halunga
2015 2015 49th Asilomar Conference on Signals, Systems and Computers  
One potential application is a nonlinear analogue of linear frequency-division multiplexing that, unlike many other fiber-optic transmission strategies, deals with both dispersion and nonlinearity unconditionally  ...  In addition, we will improve on the performance of such Coded ALOHA protocols in terms of the resource efficiency.  ...  Radix-2 algorithm with 8-parallel multi-path delay commutator.  ... 
doi:10.1109/acssc.2015.7421170 dblp:conf/acssc/CraciunescuMMKP15 fatcat:qm6mki5z6bcvrfimkmqjyrxaxm

Ping-pong beam training for reciprocal channels with delay spread

Elisabeth de Carvalho, Jorgen Bach Andersen
2015 2015 49th Asilomar Conference on Signals, Systems and Computers  
For the scenario of targets falling in the same range bin, waveforms with a flat spatial spectrum were shown to be optimal in terms of requirement for good MC performance.  ...  Its purpose is threefold: (1) Determine the suitability of processor arrays for this kind of application. (2) Develop a runtime software infrastructure that supports streaming applications on processor  ...  Radix-2 algorithm with 8-parallel multi-path delay commutator.  ... 
doi:10.1109/acssc.2015.7421451 dblp:conf/acssc/CarvalhoA15 fatcat:mqokuvnh3zg45licnfbgxyvxfu

Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs

Ryan M. Bergmann, Kelly L. Rowland, Nikola Radnović, Rachel N. Slaybaugh, Jasmina L. Vujić
2017 Annals of Nuclear Energy  
transport codes for both performance and accuracy.  ...  WARP compares well with the results of the production-level codes, and it is shown that on the newest hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms  ...  Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or limited, or assumes any legal liability or responsibility for the accuracy, completeness  ... 
doi:10.1016/j.anucene.2017.01.027 fatcat:ic2247qiizfwzpflmaptlsgwn4

Algorithmic choices in WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs

Ryan M. Bergmann, Jasmina L. Vujić
2015 Annals of Nuclear Energy  
transport codes for both performance and accuracy.  ...  WARP compares well with the results of the production-level codes, and it is shown that on the newest hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms  ...  Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or limited, or assumes any legal liability or responsibility for the accuracy, completeness  ... 
doi:10.1016/j.anucene.2014.10.039 fatcat:tj7l4yvjk5hwfel3bbpyxb5fzy
« Previous Showing results 1 — 15 out of 50 results