Filters








2,777 Hits in 5.4 sec

A hybrid approach of OpenMP for clusters

Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, Samuel Midkiff
2012 SIGPLAN notices  
Compared to previous work, this scheme features a new runtime data flow analysis and new compiler techniques for improving data affinity and reducing communication costs.  ...  We present the first fully automated compiler-runtime system that successfully translates and executes OpenMP shared-address-space programs on laboratory-size clusters, for the complete set of regular,  ...  Acknowledgments This work was supported, in part, by the National Science Foundation under grants No. 0720471-CNS, 0707931-CNS, 0833115-CCF, and 0916817-CCF.  ... 
doi:10.1145/2370036.2145827 fatcat:6rst36xk7zgznmervp7kty2h44

A hybrid approach of OpenMP for clusters

Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, Samuel Midkiff
2012 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12  
Compared to previous work, this scheme features a new runtime data flow analysis and new compiler techniques for improving data affinity and reducing communication costs.  ...  We present the first fully automated compiler-runtime system that successfully translates and executes OpenMP shared-address-space programs on laboratory-size clusters, for the complete set of regular,  ...  Acknowledgments This work was supported, in part, by the National Science Foundation under grants No. 0720471-CNS, 0707931-CNS, 0833115-CCF, and 0916817-CCF.  ... 
doi:10.1145/2145816.2145827 dblp:conf/ppopp/KwonJEM12 fatcat:ryrox34v3bfqzf5y542vmwbdh4

Description, Implementation and Evaluation of an Affinity Clause for Task Directives [chapter]

Philippe Virouleau, Adrien Roussel, François Broquedis, Thierry Gautier, Fabrice Rastello, Jean-Marc Gratien
2016 Lecture Notes in Computer Science  
We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP.  ...  Finally, we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability  ...  a software environment for very high performance computing.  ... 
doi:10.1007/978-3-319-45550-1_5 fatcat:gnodrz2aj5f65kskjh76r532qe

Automatic Scaling of OpenMP Beyond Shared Memory [chapter]

Okwan Kwon, Fahed Jubair, Seung-Jai Min, Hansang Bae, Rudolf Eigenmann, Samuel P. Midkiff
2013 Lecture Notes in Computer Science  
The present paper describes compiler algorithms and runtime techniques that provide the automatic translation of a first class of OpenMP applications: those that exhibit regular write array subscripts  ...  This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors.  ...  As well, immediate benefits can come from improved recognition of collective operations at both compile time and during the program execution, and from exploiting data affinity and advanced work partitioning  ... 
doi:10.1007/978-3-642-36036-7_1 fatcat:3jdhe3f5jjhrflkpiivp3se7yi

Cacheminer: A runtime approach to exploit cache locality on SMP

Yong Yan, Xiaodong Zhang
2000 IEEE Transactions on Parallel and Distributed Systems  
Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns  ...  However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns.  ...  Finally, we appreciate the insightful comments and critiques from the anonymous referees, which are helpful to improve the quality and readability of the paper.  ... 
doi:10.1109/71.850833 fatcat:wv4tg2o76nc4xjexamt6jaathi

Toward Efficient Execution of RVC-CAL Dataflow Programs on Multicore Platforms

Ilkka Hautala, Jani Boutellier, Teemu Nyländen, Olli Silvén
2018 Journal of Signal Processing Systems  
The results show that the proposed method offers significant improvements over the stateof-art, in terms of performance and reliability.  ...  In this work, a runtime for executing Dataflow Process Networks (DPN) on multicore platforms is proposed.  ...  To further improve performance of the proposed runtime, the DD-algorithm could be adopted to the proposed scheduler.  ... 
doi:10.1007/s11265-018-1339-x fatcat:txjhz22e3vgb3cphnql32wki7y

Knowledge-Based Adaptive Self-Scheduling [chapter]

Yizhuo Wang, Weixing Ji, Feng Shi, Qi Zuo, Ning Deng
2012 Lecture Notes in Computer Science  
The experimental results show that KASS performs 4.8% to 16.9% better than the existing self-scheduling schemes, and up to 21% better than the affinity scheduling scheme.  ...  In addition, we extend KASS to apply on loop nests and adjust the chunk sizes at runtime.  ...  An experimental study was performed to compare the KASS algorithm with classic self-scheduling algorithms (GSS, TSS, and FSS), static scheduling, and affinity scheduling algorithm.  ... 
doi:10.1007/978-3-642-35606-3_3 fatcat:xgawargcgjbvtfyaxdlpf5ctju

Programming Distributed Memory Sytems Using OpenMP

Ayon Basumallik, Seung-Jai Min, Rudolf Eigenmann
2007 2007 IEEE International Parallel and Distributed Processing Symposium  
First, we describe a combined compile-time/runtime system that uses an underlying Software Distributed Shared Memory System and exploits repetitive data access behavior in both regular and irregular program  ...  We present a compiler algorithm to detect such repetitive data references and an API to an underlying software distributed shared memory system to orchestrate the learning and proactive reuse of communication  ...  We evaluate the combined compile-time/runtime system on a selection of OpenMP applications, exhibiting both regular and irregular data reference patterns, resulting in average performance improvement of  ... 
doi:10.1109/ipdps.2007.370397 dblp:conf/ipps/BasumallikME07 fatcat:cdpbjy7ghndcxa6kh5zlryc6q4

Scheduling Dynamic OpenMP Applications over Multicore Architectures [chapter]

François Broquedis, François Diakhaté, Samuel Thibault, Olivier Aumage, Raymond Namyst, Pierre-André Wacrenier
2008 Lecture Notes in Computer Science  
Approaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache  ...  While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and  ...  and performance.  ... 
doi:10.1007/978-3-540-79561-2_15 fatcat:n5pkgkq7jzhhpostmjt4xt4oje

Rescheduling for Locality in Sparse Matrix Computations [chapter]

Michelle Mills Strout, Larry Carter, Jeanne Ferrante
2001 Lecture Notes in Computer Science  
However, sparse matrix computations have non-affine loop bounds and indirect memory references which prohibit the use of compile time loop transformations.  ...  This paper describes an algorithm to tile at runtime called serial sparse tiling.  ...  For dense matrix computations, compile time loop transformations such as tiling or blocking [17] can be used to improve data locality.  ... 
doi:10.1007/3-540-45545-0_23 fatcat:jrkyz42nzbaf7hxuqifaojnt2u

Preliminary evaluation of dynamic load balancing using loop re-partitioning on Omni/SCASH

Y. Sakae, S. Matsuoka, M. Sato, H. Harada
2003 CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings.  
Using our dynamic load balancing mechanisms, we expect that programmers can have load imbalances adjusted automatically by the runtime system without explicit definition of data and task placements in  ...  Such a commodity cluster environment, there may be incremental upgrade due to several reasons, such as rapid progress in processor technologies, or user needs and it may cause the performance heterogeneity  ...  The generated program is compiled by the native back-end compiler and linked with the runtime library.  ... 
doi:10.1109/ccgrid.2003.1199402 dblp:conf/ccgrid/SakaeSMH03 fatcat:n77y5o66lbenhb4iccj2oko2sa

Automatic runtime calculation of communications for data-parallel expressions with periodic conditions

Ana Moreton-Fernandez, Arturo Gonzalez-Escribano
2018 Concurrency and Computation  
Our technique moves to runtime part of the compile-time analysis typically used to generate communication code for affine expressions, introducing a complete new technique that also supports the periodic  ...  It makes transparent to the programmer the management of aggregated communications for the chosen data partition.  ...  COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).  ... 
doi:10.1002/cpe.4430 fatcat:mbzxinlmdnfibhhvakj6do35p4

A performance model for fine-grain accesses in UPC

Zhang Zhang, S.R. Seidel
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
The correspondence between remote references and communication events depends on the internals of the compiler and runtime system. This correspondence is often hidden from application developers.  ...  Three simple UPC applications modeled using this approach usually yielded performance predictions within 15 percent of actual running times.  ...  Caching improves MuPC performance for the vector and coalesce benchmarks and it reduces performance for the baseline write benchmark. Berkeley UPC successfully coalesces reads.  ... 
doi:10.1109/ipdps.2006.1639302 dblp:conf/ipps/ZhangS06 fatcat:ccfnf34kyjfe3facx6i4lgdu54

A technique to automatically determine Ad-hoc communication patterns at runtime

Ana Moreton-Fernandez, Arturo Gonzalez-Escribano, Diego R. Llanos
2017 Parallel Computing  
Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process.  ...  The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes.  ...  Action IC1305: Network for Sustainable Ultrascale Computing (NE-SUS), and by the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional  ... 
doi:10.1016/j.parco.2017.08.009 fatcat:66usedjgaze4tjngawddcnfyxe

Compile-time composition of run-time data and iteration reorderings

Michelle Mills Strout, Larry Carter, Jeanne Ferrante
2003 SIGPLAN notices  
To exploit locality in such applications, prior work has developed run-time reorderings to transform the computation and data.  ...  We would like to thank Hwansoo Han for making kernels and run-time inspector code available. We would also like to thank the students of CSE238 at UCSD for comments and suggestions.  ...  ACKNOWLEDGEMENTS This work was supported by an AT&T Labs Graduate Research Fellowship, a Lawrence Livermore National Labs LLNL grant, and in part by NSF Grant CCR-9808946.  ... 
doi:10.1145/780822.781142 fatcat:vfyfb57huvfo5m2v2mznxowh7i
« Previous Showing results 1 — 15 out of 2,777 results