14 Hits in 4.1 sec

Multiprocessors and run-time compilation

Joel Saltz, Harry Berryman, Janet Wu
1991 Concurrency Practice and Experience  
The run time initialization and postprocessing in the preprocessed doacross loop are rel- atively inexpensive compared to the preprocessing costs incurred by a parallelizing inspector (e.g.  ...  For the above cited lower triangular solve involving the incompletely factored Boeing Harwell test matrix, the preprocessed doacross loop requires 45 milliseconds.  ... 
doi:10.1002/cpe.4330030607 fatcat:o6klm2322zcylcgukrzqjd7mea

An Efficient Stream Data Processing Model for Multiuser Cryptographic Service

Li Li, Fenghua Li, Guozhen Shi, Kui Geng
2018 Journal of Electrical and Computer Engineering  
between the processing of the dependent job packages and parallel packages and hides the processing of the independent job package in the processing of the dependent job package.  ...  Increasing the pipeline depth and improving the processing performance in each stage of the pipeline are the key to improving the system performance.  ...  Acknowledgments is work was supported by the National Key R&D Program of China (no. 2017YFB0802705) and the National Natural Science Foundation of China (no. 61672515). References  ... 
doi:10.1155/2018/3917827 fatcat:7sfbna6vbjghtlicgqxahhzyvi

OpenMP: an industry standard API for shared-memory programming

L. Dagum, R. Menon
1998 IEEE Computational Science & Engineering  
They provide only a scalable interconnection network, and the burden of scalability falls on the software.  ...  Unfortunately, many in the high-performance computing world implicitly assume that the only way to achieve scalability in parallel software is with a messagepassing programming model.  ...  For example, in the Doacross model, the only control structure is the doacross directive, yet this is arguably the most widely used shared-memory programming model for scientific computing.  ... 
doi:10.1109/99.660313 fatcat:hxvskz2vmvbwheklyzc2wn4xci

Run-time methods for parallelizing partially parallel loops

Lawrence Rauchwerger, Nancy M. Amato, David A. Padua
1995 Proceedings of the 9th international conference on Supercomputing - ICS '95  
Given the original loop, the compiler generates inspector code that performs run-time preprocessing of the loop's access pattern, and scheduler code that schedules (and executes) the loop iterations.  ...  We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop.  ...  The basic strategy of our method is for the inspector to preprocess the memory references and determine the data dependences for each memory location accessed.  ... 
doi:10.1145/224538.224553 dblp:conf/ics/RauchwergerAP95 fatcat:zzvbbeyllndc5ovpini6ed4jym

A scalable method for run-time loop parallelization

Lawrence Rauchwerger, Nancy M. Amato, David A. Padua
1995 International journal of parallel programming  
Given the original loop, the compiler generates inspector code that performs run-time preprocessing of the loop's access pattern, and scheduler code that schedules (and executes) the loop iterations.  ...  of the loop.  ...  is a doacross loop (iterations are started in a wrapped manner) and busy waits are used to enforce certain data dependences; 4, the inspector loop sequentially traverses the access pattern; 5, the method  ... 
doi:10.1007/bf02577866 fatcat:rwplt6ri6ncizn4tb3gj6naflq

Static and dynamic evaluation of data dependence analysis techniques

P.M. Petersen, D.A. Padua
1996 IEEE Transactions on Parallel and Distributed Systems  
The tests evaluated in this study include the generalized greatest common divisor test, three variants of Banerjee's test, and the Omega test.  ...  and the growing importance of multiprocessors.  ...  However, other sources of parallelism will be exploited, including those that can be exposed by loop reverse, skewing, or transformation into doacross loops [15] , [9] .  ... 
doi:10.1109/71.544354 fatcat:q7gaa336rfbpbljaaqu2t2tw2e

The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization

L. Rauchwerger, D.A. Padua
1999 IEEE Transactions on Parallel and Distributed Systems  
determine if it had any cross-iteration dependences; if the test fails, then the loop is reexecuted serially.  ...  As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall and apply a fully parallel data dependence test to  ...  This work is not necessarily representative of the positions or policies of the Army or the Government.  ... 
doi:10.1109/71.752782 fatcat:wsjtf7kievftjdzsgmvckq7y2m

Article summaries

A.R. Hurson, K.M. Kavi, B. Shirazi, B. Lee
1996 IEEE Parallel & Distributed Technology Systems & Applications  
With such a significant on-chip hardware capacity, concurrency is a way to reduce the computation gap between the computational power demanded by the applications and that demanded by the underlymg computer  ...  Instructions impose no sequencing constraints except the one on the program's data dependencies.  ...  ACKNOWLEDGMENTS This work has been supported in part by the National Science Foundation under Grants MIP-9622836 and MIP-9622593.  ... 
doi:10.1109/88.544436 fatcat:gghokp44izf65ocixorvsnjix4

OpenMP aware MHP Analysis for Improved Static Data-Race Detection [article]

Utpal Bora, Shraiysh Vaishay, Saurabh Joshi, Ramakrishna Upadrasta
2021 arXiv   pre-print
OpenMP, the de facto shared memory parallelism framework used in the HPC community, also suffers from data races.  ...  Our experiments show that the checker is comparable to the state-of-the-art in various performance metrics with around 90% accuracy, almost perfect recall, and significantly lower runtime and memory footprint  ...  around the loop T C times.  ... 
arXiv:2111.04259v1 fatcat:zo5shq6vjfes5bytqslwihxi4u

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, Olivier Temam
2006 International journal of parallel programming  
The proposed framework relies on a unified polyhedral representation of loops and statements, using normalization rules to allow flexible and expressive transformation sequencing.  ...  loop transformations, (2) to illustrate how complex transformation sequences may be needed to achieve significant performance benefits, (3) to facilitate the automatic search for program transformation  ...  Rice University, Greg Lindahl and Fred Chow from PathScale, and the UPC team at the University of California Berkeley.  ... 
doi:10.1007/s10766-006-0012-3 fatcat:czrbuhejuzht5htht4idcisewe

Parallelization of Reordering Algorithms for Bandwidth and Wavefront Reduction

Konstantinos I. Karantasis, Andrew Lenharth, Donald Nguyen, Mara J. Garzaran, Keshav Pingali
2014 SC14: International Conference for High Performance Computing, Networking, Storage and Analysis  
the SpMV iterations without reordering the matrix.  ...  On 16 cores of the Stampede supercomputer, our parallel RCM is 5.56 times faster on the average than a state-of-the-art sequential implementation of RCM in the HSL library.  ...  ACKNOWLEDGEMENTS The work presented in this paper has been supported by the National Science Foundation grants CNS 1111407, CNS 1406355, XPS 1337281, CCF 1337281, CCF 1218568, ACI 1216701, and CNS 1064956  ... 
doi:10.1109/sc.2014.80 dblp:conf/sc/KarantasisLNGP14 fatcat:fmfw4eikzvdtjclbgy67b7ofwa

PaSh: Light-touch Data-Parallel Shell Processing [article]

Nikos Vasilakis
2021 arXiv   pre-print
An accompanying parallelizability study of POSIX and GNU commands – two large and commonly used groups – guides the annotation language and optimized aggregator library that PaSh uses.  ...  Given a script, PaSh converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a script  ...  Discussion The speedup of the preprocessing phase of the pipeline is bound by the network and IO costs since curl downloads 82GB of data.  ... 
arXiv:2007.09436v3 fatcat:4wbipg5zafgabbxmcgqstf3nkm

Efficient implementation of resource-constrained cyber-physical systems using multi-core parallelism

Olaf Neugebauer, Technische Universität Dortmund, Technische Universität Dortmund
The quest for more performance of applications and systems became more challenging in the recent years.  ...  On the other side of the performance spectrum, the demand for small energy efficient systems exposed by modern IoT applications increased vastly.  ...  Thus, this type of parallelism is also called loop-level, doall or doacross parallelism.  ... 
doi:10.17877/de290r-18927 fatcat:a5qerjncuzeiflcsh6bee5pt3y

Runtime-adaptive generalized task parallelism [article]

Kevin Streit, Universität Des Saarlandes, Universität Des Saarlandes
The following quotation has been taken from a recent publication [10] on so-called speculative cross-invocation parallelization of nested loops: Chapter 1 Publications This thesis builds on the following  ...  Consequently, not only the implementation, but also many ideas and solutions presented on the following pages are in the end the result of our collaboration.  ...  The def-use relation is used in this thesis to define reduction properties on page 77. DOACROSS loop is, like DOALL, one of the classical loop parallelization techniques.  ... 
doi:10.22028/d291-26876 fatcat:kzenjr4jd5dijkctxgszexbzja