A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
A hybrid approach of OpenMP for clusters
2012
SIGPLAN notices
Compared to previous work, this scheme features a new runtime data flow analysis and new compiler techniques for improving data affinity and reducing communication costs. ...
We present the first fully automated compiler-runtime system that successfully translates and executes OpenMP shared-address-space programs on laboratory-size clusters, for the complete set of regular, ...
Acknowledgments This work was supported, in part, by the National Science Foundation under grants No. 0720471-CNS, 0707931-CNS, 0833115-CCF, and 0916817-CCF. ...
doi:10.1145/2370036.2145827
fatcat:6rst36xk7zgznmervp7kty2h44
A hybrid approach of OpenMP for clusters
2012
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12
Compared to previous work, this scheme features a new runtime data flow analysis and new compiler techniques for improving data affinity and reducing communication costs. ...
We present the first fully automated compiler-runtime system that successfully translates and executes OpenMP shared-address-space programs on laboratory-size clusters, for the complete set of regular, ...
Acknowledgments This work was supported, in part, by the National Science Foundation under grants No. 0720471-CNS, 0707931-CNS, 0833115-CCF, and 0916817-CCF. ...
doi:10.1145/2145816.2145827
dblp:conf/ppopp/KwonJEM12
fatcat:ryrox34v3bfqzf5y542vmwbdh4
Description, Implementation and Evaluation of an Affinity Clause for Task Directives
[chapter]
2016
Lecture Notes in Computer Science
We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP. ...
Finally, we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability ...
a software environment for very high performance computing. ...
doi:10.1007/978-3-319-45550-1_5
fatcat:gnodrz2aj5f65kskjh76r532qe
Automatic Scaling of OpenMP Beyond Shared Memory
[chapter]
2013
Lecture Notes in Computer Science
The present paper describes compiler algorithms and runtime techniques that provide the automatic translation of a first class of OpenMP applications: those that exhibit regular write array subscripts ...
This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors. ...
As well, immediate benefits can come from improved recognition of collective operations at both compile time and during the program execution, and from exploiting data affinity and advanced work partitioning ...
doi:10.1007/978-3-642-36036-7_1
fatcat:3jdhe3f5jjhrflkpiivp3se7yi
Cacheminer: A runtime approach to exploit cache locality on SMP
2000
IEEE Transactions on Parallel and Distributed Systems
Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns ...
However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. ...
Finally, we appreciate the insightful comments and critiques from the anonymous referees, which are helpful to improve the quality and readability of the paper. ...
doi:10.1109/71.850833
fatcat:wv4tg2o76nc4xjexamt6jaathi
Toward Efficient Execution of RVC-CAL Dataflow Programs on Multicore Platforms
2018
Journal of Signal Processing Systems
The results show that the proposed method offers significant improvements over the stateof-art, in terms of performance and reliability. ...
In this work, a runtime for executing Dataflow Process Networks (DPN) on multicore platforms is proposed. ...
To further improve performance of the proposed runtime, the DD-algorithm could be adopted to the proposed scheduler. ...
doi:10.1007/s11265-018-1339-x
fatcat:txjhz22e3vgb3cphnql32wki7y
Knowledge-Based Adaptive Self-Scheduling
[chapter]
2012
Lecture Notes in Computer Science
The experimental results show that KASS performs 4.8% to 16.9% better than the existing self-scheduling schemes, and up to 21% better than the affinity scheduling scheme. ...
In addition, we extend KASS to apply on loop nests and adjust the chunk sizes at runtime. ...
An experimental study was performed to compare the KASS algorithm with classic self-scheduling algorithms (GSS, TSS, and FSS), static scheduling, and affinity scheduling algorithm. ...
doi:10.1007/978-3-642-35606-3_3
fatcat:xgawargcgjbvtfyaxdlpf5ctju
Programming Distributed Memory Sytems Using OpenMP
2007
2007 IEEE International Parallel and Distributed Processing Symposium
First, we describe a combined compile-time/runtime system that uses an underlying Software Distributed Shared Memory System and exploits repetitive data access behavior in both regular and irregular program ...
We present a compiler algorithm to detect such repetitive data references and an API to an underlying software distributed shared memory system to orchestrate the learning and proactive reuse of communication ...
We evaluate the combined compile-time/runtime system on a selection of OpenMP applications, exhibiting both regular and irregular data reference patterns, resulting in average performance improvement of ...
doi:10.1109/ipdps.2007.370397
dblp:conf/ipps/BasumallikME07
fatcat:cdpbjy7ghndcxa6kh5zlryc6q4
Scheduling Dynamic OpenMP Applications over Multicore Architectures
[chapter]
2008
Lecture Notes in Computer Science
Approaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache ...
While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and ...
and performance. ...
doi:10.1007/978-3-540-79561-2_15
fatcat:n5pkgkq7jzhhpostmjt4xt4oje
Rescheduling for Locality in Sparse Matrix Computations
[chapter]
2001
Lecture Notes in Computer Science
However, sparse matrix computations have non-affine loop bounds and indirect memory references which prohibit the use of compile time loop transformations. ...
This paper describes an algorithm to tile at runtime called serial sparse tiling. ...
For dense matrix computations, compile time loop transformations such as tiling or blocking [17] can be used to improve data locality. ...
doi:10.1007/3-540-45545-0_23
fatcat:jrkyz42nzbaf7hxuqifaojnt2u
Preliminary evaluation of dynamic load balancing using loop re-partitioning on Omni/SCASH
2003
CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings.
Using our dynamic load balancing mechanisms, we expect that programmers can have load imbalances adjusted automatically by the runtime system without explicit definition of data and task placements in ...
Such a commodity cluster environment, there may be incremental upgrade due to several reasons, such as rapid progress in processor technologies, or user needs and it may cause the performance heterogeneity ...
The generated program is compiled by the native back-end compiler and linked with the runtime library. ...
doi:10.1109/ccgrid.2003.1199402
dblp:conf/ccgrid/SakaeSMH03
fatcat:n77y5o66lbenhb4iccj2oko2sa
Automatic runtime calculation of communications for data-parallel expressions with periodic conditions
2018
Concurrency and Computation
Our technique moves to runtime part of the compile-time analysis typically used to generate communication code for affine expressions, introducing a complete new technique that also supports the periodic ...
It makes transparent to the programmer the management of aggregated communications for the chosen data partition. ...
COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). ...
doi:10.1002/cpe.4430
fatcat:mbzxinlmdnfibhhvakj6do35p4
A performance model for fine-grain accesses in UPC
2006
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
The correspondence between remote references and communication events depends on the internals of the compiler and runtime system. This correspondence is often hidden from application developers. ...
Three simple UPC applications modeled using this approach usually yielded performance predictions within 15 percent of actual running times. ...
Caching improves MuPC performance for the vector and coalesce benchmarks and it reduces performance for the baseline write benchmark. Berkeley UPC successfully coalesces reads. ...
doi:10.1109/ipdps.2006.1639302
dblp:conf/ipps/ZhangS06
fatcat:ccfnf34kyjfe3facx6i4lgdu54
A technique to automatically determine Ad-hoc communication patterns at runtime
2017
Parallel Computing
Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. ...
The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. ...
Action IC1305: Network for Sustainable Ultrascale Computing (NE-SUS), and by the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional ...
doi:10.1016/j.parco.2017.08.009
fatcat:66usedjgaze4tjngawddcnfyxe
Compile-time composition of run-time data and iteration reorderings
2003
SIGPLAN notices
To exploit locality in such applications, prior work has developed run-time reorderings to transform the computation and data. ...
We would like to thank Hwansoo Han for making kernels and run-time inspector code available. We would also like to thank the students of CSE238 at UCSD for comments and suggestions. ...
ACKNOWLEDGEMENTS This work was supported by an AT&T Labs Graduate Research Fellowship, a Lawrence Livermore National Labs LLNL grant, and in part by NSF Grant CCR-9808946. ...
doi:10.1145/780822.781142
fatcat:vfyfb57huvfo5m2v2mznxowh7i
« Previous
Showing results 1 — 15 out of 2,777 results