Filters








4,866 Hits in 4.1 sec

A hybrid scheme for efficiently executing nested loops on multiprocessors

Chien-Min Wang, Sheng-De Wang
1992 Parallel Computing  
Wang, A hybrid scheme for efficiently executing nested loops on multiprocessors, Parallel Computing i 8 ( ! 992) 625-637.  ...  In this paper, we address the problem of scheduling parallel processors for efficiently executing nested loops.  ...  To achieve both objectives for hybrid nested loops, we propose a new scheduling scheme. It is a hybrid scheme of the run-time scheduling and the compile-time loop transformations.  ... 
doi:10.1016/0167-8191(92)90003-p fatcat:oyy52sngfbckdf43su6mp37rui

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs [chapter]

Nikolaos Drosinos, Nectarios Koziris
2003 Lecture Notes in Computer Science  
We further apply an advanced hyperplane scheduling scheme that enables pipelined execution and the overlapping of communication with useful computation, thus leading almost to full CPU utilization.  ...  The parallelization process of nested-loop algorithms onto popular multi-level parallel architectures, such as clusters of SMPs, is not a trivial issue, since the existence of data dependencies in the  ...  In this paper we propose two hybrid MPI/OpenMP programming paradigms for the efficient parallelization of perfectly nested loop algorithms, namely a fine-grain model, as well as a coarse-grain one.  ... 
doi:10.1007/978-3-540-39924-7_30 fatcat:nsensostijb2bixnghpbxpyxau

The OpenTM Transactional Application Programming Interface

Woongki Baek, Chi Cao Minh, Martin Trautmann, Christos Kozyrakis, Kunle Olukotun
2007 Parallel Architecture and Compilation Techniques (PACT), Proceedings of the International Conference on  
We also present a portable OpenTM implementation that produces code for hardware, software, and hybrid TM systems.  ...  Overall, OpenTM provides a practical and efficient TM programming environment within the familiar scope of OpenMP.  ...  Woongki Baek is supported by an STMicroelectronics Stanford Graduate Fellowship and a Samsung Scholarship.  ... 
doi:10.1109/pact.2007.4336227 fatcat:nn7gbfngvrff5egt4jzfgurptm

Hybrid Hexagonal/Classical Tiling for GPUs

Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, Sven Verdoolaege
2014 Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization - CGO '14  
Time-tiling is necessary for the efficient execution of iterative stencil computations.  ...  We propose a time-tiling method for iterative stencil computations on GPUs. Our method does not involve redundant computations.  ...  This work is partly funded by a Google European Fellowship in Efficient Computing, by the European FP7 project CARP id. 287767, by the COPCAMS ARTEMIS project, and award 0926688 from the U.S. NSF.  ... 
doi:10.1145/2581122.2544160 fatcat:mxabceid25cobd4dekka633kna

Loop coalescing and scheduling for barrier MIMD architectures

M.T. O'Keefe, H.G. Dietz
1993 IEEE Transactions on Parallel and Distributed Systems  
Also, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds.  ...  The basic approach employs loop coalescing, a technique for transform ing a multiply-nested loop into a single loop.  ...  Also, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds.  ... 
doi:10.1109/71.243531 fatcat:pigu6baugzgp5drbq7iol77hce

Coloured and task-based stencil codes [article]

Benjamin Hazelwood, Tobias Weinzierl
2018 arXiv   pre-print
We evaluate traditional multithreading strategies on both Broadwell and KNL, study the arising assignment of tasks to threads and, from there, derive two efficient ways to parallelise stencil codes on  ...  Delegating the identification of a traversal order to a scheduler, we however rely on this scheduler to puzzle out an efficient ordering on-the-fly.  ...  We label this approach as Hyb-depend as it is a hybrid.  ... 
arXiv:1810.04033v1 fatcat:2an47dserzcodkeahdwderggxa

Improving performance of nested loops on reconfigurable array processors

Yongjoo Kim, Jongeun Lee, Toan X. Mai, Yunheung Paek
2012 ACM Transactions on Architecture and Code Optimization (TACO)  
In this paper we evaluate the overhead of such non-kernel execution times when mapping nested loops for CGRAs, and propose a novel architecture-compiler cooperative scheme to reduce the overhead, while  ...  , and can become significant if it is repeated in an outer loop of a loop nest.  ...  Table II lists the steps of a loop execution procedure for a CGRA coprocessor.  ... 
doi:10.1145/2086696.2086711 fatcat:hmtwm44bsrflhasu6bvnjeepcm

Reducing the burden of parallel loop schedulers for many‐core processors

Mahwish Arif, Hans Vandierendonck
2021 Concurrency and Computation  
This article enhances the scalability of parallel loop schedulers by specializing schedulers for fine-grain loops.  ...  We propose a low-overhead work distribution mechanism for a static scheduler that uses no atomic operations.  ...  This leads to a best-case improvement of 2.8× for linear regression. Nested parallelism We apply the hybrid static/dynamic scheduler to Ligra, 4 a graph analytics system.  ... 
doi:10.1002/cpe.6241 fatcat:4rluruunxjb4dehant4kjl354e

An ASIP for Neural Network Inference on Embedded Devices with 99% PE Utilization and 100% Memory Hidden under Low Silicon Cost

Muxuan Gao, He Chen, Dake Liu
2022 Sensors  
The scalability and system performance of our SoC extension scheme were demonstrated. The VLIW was used to execute multiple instructions in parallel.  ...  For energy efficiency, there are huge opportunities for power efficiency optimization, which involves access minimization and memory latency minimization based on on-chip memory minimization.  ...  There are six nested loops in Figure 2a , which fit the slide-window type of kernels. For other kernel types, there may not be that many nested loops.  ... 
doi:10.3390/s22103841 pmid:35632250 pmcid:PMC9146143 fatcat:prglmlzkvfc2pchvdycgnbhmvy

The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization [article]

Riyadh Baghdadi, Albert Cohen, Cedric Bastoul, Louis-Noel Pouchet and Lawrence Rauchwerger
2011 arXiv   pre-print
involving hybrid static-dynamic schemes.  ...  This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations.  ...  We also plan to extend parallelism detection among acyclic control-flow regions nested into loop nests, combining affine loop transformations with decoupled software pipelining [8] .  ... 
arXiv:1111.6756v1 fatcat:zmkjavzrqfav7hxycsdmaf6ypq

Hybrid Static/Dynamic Schedules for Tiled Polyhedral Programs [article]

Tian Jin, Nirmal Prajapati, Waruna Ranasinghe, Guillaume Iooss, Yun Zou, Sanjay Rajopadhye, David Wonnacott
2016 arXiv   pre-print
We present a system to express and generate code for hybrid schedules, where some constraints are automatically satisfied through the structure of the code, and the remainder are dynamically enforced at  ...  We propose a generic mechanism to implement the needed synchronization, and show it can be easily realized for a variety of targets: OpenMP, Pthreads, GPU (CUDA or OpenCL) code, languages like X10, Habanero  ...  [5] uses data-flow runtime synchronization for GPGPU code that executes tiled loop nests.  ... 
arXiv:1610.07236v1 fatcat:afpx6tyjm5c6nb3nuj3pxklkmi

An Efficient Approach for Self-scheduling Parallel Loops on Multiprogrammed Parallel Computers [chapter]

Arun Kejariwal, Alexandru Nicolau, Constantine D. Polychronopoulos
2006 Lecture Notes in Computer Science  
In this paper, we present a dynamic scheduling technique for scheduling iterations of a DOALL loop (of a single application) to achieve load balance between a given set of processors.  ...  It calls for multiprogrammed scheduling of the different jobs for effective system utilization and for keeping average response times low.  ...  We argue that it is important to account for such gaps during the scheduling of parallel tasks -iterations of a (nested) parallel loop in our case.  ... 
doi:10.1007/978-3-540-69330-7_31 fatcat:ommmuq5btfc3zjocfwslsfjhy4

Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

Filip Blagojevic, Dimitrios S. Nikolopoulos, Alexandros Stamatakis, Christos D. Antonopoulos, Matthew Curtis-Maury
2007 Parallel Computing  
We also present a new scheduling scheme for dynamic multi-grain parallelism, S-MGPS, which uses sampling of dominant execution phases to converge to the optimal scheduling algorithm.  ...  We evaluate recently introduced schedulers for event-driven execution and utilization-driven dynamic multi-grain parallelization on Cell.  ...  We thank Xizhou Feng and Kirk Cameron for providing us with the MPI version of PBPI. We are also grateful to the anonymous reviewers for their constructive feedback on earlier versions of this paper.  ... 
doi:10.1016/j.parco.2007.09.004 fatcat:zxkvfw76ljedbbqhupvvahwolm

Exploiting Fine-Grain Thread Parallelism on Multicore Architectures

P.E. Hadjidoukas, G.Ch. Philos, V.V. Dimakopoulos
2009 Scientific Programming  
In this work we present a runtime threading system which provides an efficient substrate for fine-grain parallelism, suitable for deployment in multicore platforms.  ...  The runtime system has been integrated into an OpenMP implementation to allow for transparent usage under a high level programming paradigm.  ...  The Cilk runtime system maintains a local ready queue for each processor and deploys an efficient work-stealing scheduler.  ... 
doi:10.1155/2009/249651 fatcat:tjc25fxytnfgve5o2j5ijq22uy

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Xingfu Wu, Valerie Taylor, Charles Lively, Sameh Sharkawi
2008 Parallel Processing  
In terms of refinements, we use conventional techniques such as cache blocking, loop unrolling and loop fusion, and develop hybrid methods for optimizing MPI_Allreduce and MPI_Reduce.  ...  A major challenge to be addressed is efficient use of such cluster systems for large-scale scientific applications.  ...  Acknowledgements The authors would like to thank Stephane Ethier and Shirley Moore for providing the GTC code and datasets, and Dazhi Yu and Jacques Richard for providing the LBM code.  ... 
doi:10.1109/icpp-w.2008.21 dblp:conf/icppw/WuTLS08 fatcat:tb7h5un3mfec7lskh64ydgdkx4
« Previous Showing results 1 — 15 out of 4,866 results