31,713 Hits in 7.8 sec

Effective communication coalescing for data-parallel applications

Daniel Chavarría-Miranda, John Mellor-Crummey
2005 Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '05  
Communication coalescing is a static optimization that can reduce both communication frequency and redundant data transfer in compiler-generated code for regular, data parallel applications.  ...  We present an algorithm for coalescing communication that arises when generating code for regular, data-parallel applications written in High-Performance Fortran (HPF).  ...  It is a challenging application to parallelize effectively due to the potential for generating many small messages between processors.  ... 
doi:10.1145/1065944.1065948 dblp:conf/ppopp/Chavarria-MirandaM05 fatcat:s2vo63poznhz3jvbwniyfs4o2y

Active pebbles

Jeremiah James Willcock, Torsten Hoefler, Nicholas Gerard Edmonds, Andrew Lumsdaine
2011 Proceedings of the 16th ACM symposium on Principles and practice of parallel programming - PPoPP '11  
A variety of programming models exist to support large-scale, distributed memory, parallel computation.  ...  Fine-grained, irregular, and unstructured applications such as those found in biology, social network analysis, and graph theory are less well supported.  ...  Conclusion Active Pebbles consists of two parts, a programming model which allows fine-grained, unstructured, data-driven applications to be expressed at their natural level of granularity, and an execution  ... 
doi:10.1145/1941553.1941601 dblp:conf/ppopp/WillcockHEL11 fatcat:skwkwupkmnhxlk3msu7dd2jdti

Communication optimizations for fine-grained UPC applications

Wei-Yu Chen, C. Iancu, K. Yelick
2005 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05)  
In this paper we present three optimization techniques for global address space programs with fine-grained communication: redundancy elimination, use of split-phase communication, and communication coalescing  ...  applications.  ...  The matrix data is distributed along columns, and communication occurs in the form of accesses to elements on the same row. For this benchmark coalescing is not applicable.  ... 
doi:10.1109/pact.2005.13 dblp:conf/IEEEpact/ChenIY05 fatcat:d36dl2nkyzcpthtp5sxjlzuyv4

A Simple BSP-based Model to Predict Execution Time in GPU Applications

Marcos Amaris, Daniel Cordeiro, Alfredo Goldman, Raphael Y. de Camargo
2015 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)  
The main idea of BSP model is the treatment of communication and computation as abstractions of a parallel system.  ...  The Bulk Synchronous Parallel (BSP) is a bridging model for parallel computation that allows algorithmic analysis of programs on parallel computers using performance modeling.  ...  ACKNOWLEDGMENT We would like to thank Cleber Silva Ferreira da Luz for the source-code of the maximum subarray problem.  ... 
doi:10.1109/hipc.2015.34 dblp:conf/hipc/AmarisCGC15 fatcat:e34dq32mxjd7fap7ukcqa3a2xa

On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit

Shucai Xiao, Ashwin M. Aji, Wu-chun Feng
2009 2009 15th International Conference on Parallel and Distributed Systems  
Graphics processing units (GPUs) have been widely used to accelerate algorithms that exhibit massive data parallelism or task parallelism.  ...  When such parallelism is not inherent in an algorithm, computational scientists resort to simply replicating the algorithm on every multiprocessor of a NVIDIA GPU, for example, to create such parallelism  ...  Acknowledgments We would like to thank Heshan Lin, Jeremy Archuleta, Tom Scogland, and Song Huang for their technical support and feedback on the manuscript.  ... 
doi:10.1109/icpads.2009.110 dblp:conf/icpads/XiaoAF09 fatcat:i7feinjdkncalglj667eew3zba


Jeremiah James Willcock, Torsten Hoefler, Nicholas Gerard Edmonds, Andrew Lumsdaine
2010 Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10  
Active messages have proven to be an effective approach for certain communication problems in high performance computing.  ...  Our library allows message handlers to be run in an explicit loop that can be optimized and vectorized by the compiler and that can also be executed in parallel on multicore architectures.  ...  We also thank Prabhanjan Kambadur and Laura Hopkins for helpful discussions.  ... 
doi:10.1145/1854273.1854323 dblp:conf/IEEEpact/WillcockHEL10 fatcat:6vcf4fc2ovhexe73w7pbx56bwa

Instruction set extensions for photonic synchronous coalesced accesses

Paul Keltcher, David Whelihan, Jeffrey Hughes
2013 2013 IEEE High Performance Extreme Computing Conference (HPEC)  
This operation is described, and its ISA implications explored in the context of the distributed matrix transpose, which exhibits a high degree of data non-locality, and is difficult to efficiently parallelize  ...  This lack of explicit synchrony, caused by limitations of metal interconnect, limits parallel efficiency.  ...  In that paper, the Synchronous Coalesced Access (SCA), which alleviates the effects of non-locality by reorganizing data in-flight in a photonic waveguide, is introduced.  ... 
doi:10.1109/hpec.2013.6670326 dblp:conf/hpec/KeltcherWH13 fatcat:y7fki3y375fsvpdumbgldwy4ze

A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior

Joel Hestness, Stephen W. Keckler, David A. Wood
2014 2014 IEEE International Symposium on Workload Characterization (IISWC)  
This paper presents a detailed comparison of memory access behavior for parallel applications executing on each core type in tightly-controlled heterogeneous CPU-GPU processor simulation.  ...  This characterization indicates that applications are typically designed with similar algorithmic structures for CPU and GPU cores, and each core type's memory access path has a similar locality filtering  ...  We perform this analysis for data-parallel applications.  ... 
doi:10.1109/iiswc.2014.6983054 dblp:conf/iiswc/HestnessKW14 fatcat:k76obdosvfhi5aftguivbbyhbe

Approaching Long Genomic Regions and Large Recombination Rates with msParSm as an Alternative to MaCS

Carlos Montemuiño, Antonio Espinosa, Juan C. Moure, Gonzalo Vera, Porfidio Hernández, Sebastián Ramos-Onsins
2016 Evolutionary Bioinformatics  
The msParSm application is an evolution of msPar, the parallel version of the coalescent simulation program ms, which removes the limitation for simulating long stretches of DNA sequences with large recombination  ...  rates, without compromising the accuracy of the standard coalescence.  ...  The speedup is defined as S p = T 1 /T p , where p is the number of processors, T 1 is the execution time of the sequential application, and T p is the execution time of the parallel application with p  ... 
doi:10.4137/ebo.s40268 pmid:27721650 pmcid:PMC5047705 fatcat:2j2bbo5einhbrcpnhg3au7zxpi

The Anatomy of Large-Scale Distributed Graph Algorithms [article]

Jesun Sahariar Firoz, Thejaka Amila Kanewala, Marcin Zalewski, Martina Barnas, Andrew Lumsdaine
2015 arXiv   pre-print
The performance analysis becomes a truly experimental science, even more challenging in the presence of massive irregularity and data dependency.  ...  To begin this process, we provide an initial set of recommendations for describing DGA results based on our analysis of the current state of the field.  ...  For example, Cray MPI provides an option for starting progression pthreads that perform internal MPI progress in parallel with the application threads.  ... 
arXiv:1507.06702v1 fatcat:yolns423c5fxhcsgkoeghrxgte

Combining Static and Dynamic Data Coalescing in Unified Parallel C

Michail Alvanos, Montse Farreras, Ettore Tiotto, Jose Nelson Amaral, Xavier Martorell
2016 IEEE Transactions on Parallel and Distributed Systems  
These languages allow fine-grained communication and lead to programs that perform many fine-grained accesses to data.  ...  This paper addresses important limitations in the code generation for Partitioned Global Address Space (PGAS) languages.  ...  Parallel languages and programming models must provide simple means for developing applications that can run on parallel systems without sacrificing performance.  ... 
doi:10.1109/tpds.2015.2405551 fatcat:isr4fuw6nvfpzfo4abngauwame

Expressing graph algorithms using generalized active messages

Nick Edmonds, Jeremiah Willcock, Andrew Lumsdaine
2013 Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13  
The data-driven nature of graph applications necessitates a more complex application stack incorporating runtime optimization.  ...  Practical implementations and performance results are provided for a number of representative algorithms.  ...  Programming Model Message passing, an effective programming model for regular HPC applications, provides a clear separation of address spaces and makes all communication explicit.  ... 
doi:10.1145/2442516.2442549 dblp:conf/ppopp/EdmondsWL13 fatcat:o37koj7wkbcfzm6zlls3n63f3y

Enhancing Performance Portability of MPI Applications through Annotation-Based Transformations

Md. Ziaul Haque, Qing Yi, James Dinan, Pavan Balaji
2013 2013 42nd International Conference on Parallel Processing  
optimization, Abstract-MPI is the de facto standard for portable parallel programming on high-end systems.  ...  We use our annotation-based approach to optimize several benchmark kernels, and we demonstrate that the framework is effective at automatically improving performance portability for MPI applications.  ...  ACKNOWLEDGMENTS This work was supported through resource grants from the Argonne Leadership Computing Facility, the Argonne Laboratory Computing Resource Center, the Oak Ridge National Center for Computational  ... 
doi:10.1109/icpp.2013.77 dblp:conf/icpp/HaqueYDB13 fatcat:t432ks4ahba25byi2sshbz777y

A performance model for fine-grain accesses in UPC

Zhang Zhang, S.R. Seidel
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
UPC's implicit communication and fine-grain programming style make application performance modeling a challenging task.  ...  The correspondence between remote references and communication events depends on the internals of the compiler and runtime system. This correspondence is often hidden from application developers.  ...  Note that the effects of these optimizations are not disjoint. For example, remote access caching can sometimes provide the effect of coalescing multiple accesses to the same remote thread.  ... 
doi:10.1109/ipdps.2006.1639302 dblp:conf/ipps/ZhangS06 fatcat:ccfnf34kyjfe3facx6i4lgdu54

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors [chapter]

Kai Zhang, ShuMing Chen, Wei Liu, Xi Ning
2013 Lecture Notes in Computer Science  
By transforming the non-coalesced memory access to coalesced version, the proposed algorithm can achieve the high pipeline parallelism and the high efficient memory access.  ...  The fine-grained algorithm well utilizes data dependences of the native algorithm to explore the fine-grained parallelism among all the computation resources.  ...  The new algorithm well corresponds to data dependences of the native algorithm and effectively transforms the non-coalesced memory access to coalesced version, which increases the pipeline parallelism,  ... 
doi:10.1007/978-3-642-40820-5_4 fatcat:ysyroxx5abgixozjkpj5s7wyam
« Previous Showing results 1 — 15 out of 31,713 results