Filters








130 Hits in 5.9 sec

Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS

Abdelhalim Amer, Huiwei Lu, Pavan Balaji, Satoshi Matsuoka
2015 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing  
With the increasing prominence of many-core architectures and decreasing per-core resources on large supercomputers, a number of applications developers are investigating the use of hybrid MPI+threads  ...  In this paper, we use a distributed implementation of the breadth-first search algorithm in order to understand the performance characteristics of MPI-only and MPI+threads models at scale.  ...  CONCLUDING REMARKS In this work, we studied the MPI-only and hybrid MPI+threads models using the BFS algorithm at very large scale.  ... 
doi:10.1109/ccgrid.2015.93 dblp:conf/ccgrid/AmerLBM15 fatcat:h6llvqiuovb43c43mgyobcbb5e

A case study in top-down performance estimation for a large-scale parallel application

Ilya Sharapov, Robert Kroeger, Guy Delamarter, Razvan Cheveresan, Matthew Ramsay
2006 Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06  
Lowlevel analysis is complemented with scalability estimates based on modeling MPI, OpenMP and I/O activity in the code.  ...  For GTC, we identify the important phases of the iteration and perform low-level analysis that includes instruction tracing and component simulations of processor and memory systems.  ...  including Russ Brown, Lodewijk Bonebakker, Larry Brisson, John Busch, Chris Feucht, John Fredricksen, Ilya Gluhovsky, George Herman, Pranay Koka, Michael Koster, Eugene Loh, Brian O'Krafka, Andrew Over and  ... 
doi:10.1145/1122971.1122985 dblp:conf/ppopp/SharapovKDCR06 fatcat:2vuaiivpendazfoo5tymauu2ay

Parallel Breadth-First Search on Distributed Memory Systems [article]

Aydin Buluc, Kamesh Madduri
2011 arXiv   pre-print
For both approaches, we also present hybrid versions with intra-node multithreading.  ...  vertices and 68.7 billion edges with skewed degree distribution.  ...  Acknowledgments Discussions with John R. Gilbert, Steve Reinhardt, and Adam Lugowski greatly improved our understanding of casting BFS iterations into sparse linear algebra.  ... 
arXiv:1104.4518v2 fatcat:a7nvtwil35dbtohgpnsfldeeki

Parallel breadth-first search on distributed memory systems

Aydin Buluç, Kamesh Madduri
2011 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11  
For both approaches, we also present hybrid versions with intra-node multithreading.  ...  vertices and 68.7 billion edges with skewed degree distribution.  ...  We experiment with two parallel programming models: "Flat MPI" with one process per core, and a hybrid implementation with one or more MPI processes within a node.  ... 
doi:10.1145/2063384.2063471 dblp:conf/sc/BulucM11 fatcat:cn4tlzqd4ndqlhekngx76hjvhy

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0) [article]

Saurabh Hukerikar, Christian Engelmann
2016 arXiv   pre-print
The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a  ...  We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems.  ...  Each block contains at least a primary design and exceptional case handler along with an adjudicator.  ... 
arXiv:1611.02717v2 fatcat:sumkgkwokzaonemt6oxnyxysra

Graphite

Mohammad Hasanzadeh Mofrad, Rami Melhem, Yousuf Ahmad, Mohammad Hammoud
2020 Proceedings of the VLDB Endowment  
MPI part), then splits each partition into subpartitions among the threads of each process as a method to scale up within a machine (the X part).  ...  Consequently, it contrasts with the traditional MPI + X parallelism model, which utilizes process-based partitioning to distribute data among processes as a way to scale out on a cluster of machines (the  ...  The conventional MPI + X parallelism model [4, 52, 59] is a hybrid scheme with: (1) a Message Passing Interface (MPI) [26, 46] used for horizontal scaling (or scaling out) across cluster nodes, and  ... 
doi:10.14778/3380750.3380751 fatcat:obcasaxxfnadnhhinw5arnli4a

Improving sparse data movement performance using multiple paths on the Blue Gene/Q supercomputer

Huy Bui, Eun-Sung Jung, Venkatram Vishwanath, Andrew Johnson, Jason Leigh, Michael E. Papka
2016 Parallel Computing  
We demonstrate the efficacy of our solutions through a set of microbenchmarks and application benchmarks on Mira scaling up to 131,072 compute cores.  ...  The results show that our approach achieves up to 5X improvement in achievable throughput compared with the default mechanisms.  ...  This research used resources of the Argonne Leadership Computing Facility (ALCF) at Argonne National Laboratory. We thank the ALCF team for discussions and help related to this paper.  ... 
doi:10.1016/j.parco.2015.09.002 fatcat:4ze4glj42neqhdyec6yv4xfx3a

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism [article]

Yosuke Oyama, Naoya Maruyama, Nikoli Dryden, Erin McCarthy, Peter Harrington, Jan Balewski, Satoshi Matsuoka, Peter Nugent, Brian Van Essen
2020 arXiv   pre-print
Our comprehensive performance studies show that good weak and strong scaling can be achieved for both networks using up 2K GPUs.  ...  We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks.  ...  In prior work, LBANN has been optimized to provide parallel I/O using both MPI and multi-threading, but was limited to a single MPI rank per sample.  ... 
arXiv:2007.12856v1 fatcat:okzrjrohtzdgfpnsdgjl4yfnki

Pardicle: Parallel Approximate Density-Based Clustering

Md. Mostofa Ali Patwary, Nadathur Satish, Narayanan Sundaram, Fredrik Manne, Salman Habib, Pradeep Dubey
2014 SC14: International Conference for High Performance Computing, Networking, Storage and Analysis  
Our experiments on astrophysics and synthetic massive datasets (8.5 billion numbers) shows that our approximate algorithm is up to 56x faster than exact algorithms with almost identical quality (Omega-Index  ...  We demonstrate near-linear speedup on shared memory (15x using 16 cores, single node Intel R Xeon R processor) and distributed memory (3917x using 4096 cores, multinode) computers, with 2x additional performance  ...  For our algorithm to scale and be accurate, we need to minimize the number of such cases. The most important factor here is how thread ownership is done.  ... 
doi:10.1109/sc.2014.51 dblp:conf/sc/PatwarySSMHD14 fatcat:5qzt72v3cbcbtemhokcjttnf34

Exploring HPC and Big Data Convergence: A Graph Processing Study on Intel Knights Landing

Alexandru Uta, Ana Lucia Varbanescu, Ahmed Musaafir, Chris Lemaire, Alexandru Iosup
2018 2018 IEEE International Conference on Cluster Computing (CLUSTER)  
The hardware is currently different, and fast evolving: big data uses machines with modest numbers of fat cores per socket, large caches, and much memory, whereas HPC uses machines with larger numbers  ...  In this work, we investigate the convergence of big data and HPC infrastructure for one of the most challenging application domains, the highly irregular graph processing.  ...  The infrastructure was kindly provided by Intel as a gift, by the Dutch Supercomputing Center through the HPGraph NWO grant, and by the Dutch DAS5 Supercomputing infrastructure co-sponsored by NWO.  ... 
doi:10.1109/cluster.2018.00019 dblp:conf/cluster/UtaVMLI18 fatcat:fmhqysujyrgf7jzr4ooxn6s7xy

In situ and in-transit analysis of cosmological simulations

Brian Friesen, Ann Almgren, Zarija Lukić, Gunther Weber, Dmitriy Morozov, Vincent Beckner, Marcus Day
2016 Computational Astrophysics and Cosmology  
We demonstrate this approach in the compressible gasdynamics/N-body code Nyx, a hybrid MPI + OpenMP code based on the BoxLib framework, used for large-scale cosmological simulations.  ...  The other consists of partitioning processes into disjoint MPI groups, with one performing the simulation and periodically sending data to the other 'sidecar' group, which post-processes it while the simulation  ...  Most figures in this work were generated with matplotlib (Hunter 2007 . This work made extensive use of the NASA Astrophysics Data System and of the astro-ph preprint archive at arXiv.org.  ... 
doi:10.1186/s40668-016-0017-2 pmid:31149559 pmcid:PMC6511997 fatcat:tpxxzs6uqbgylmtviitls6uxim

The Peano software - parallel, automaton-based, dynamically adaptive grid traversals [article]

Tobias Weinzierl
2018 arXiv   pre-print
The traversal can exploit regular grid subregions and shared memory as well as distributed memory systems with almost no modifications to a serial application code.  ...  We further sketch the supported application types and the two data storage schemes realized, before we detail high-performance computing aspects and lessons learned.  ...  Thanks are due to all the scientists and students who contributed to the software in terms of software fragments, applications, extensions and critical remarks.  ... 
arXiv:1506.04496v6 fatcat:iwgintogxjgiviybhvymz5xxou

The Peano Software—Parallel, Automaton-based, Dynamically Adaptive Grid Traversals

Tobias Weinzierl
2019 ACM Transactions on Mathematical Software  
Thanks are due to all the scientists and students who contributed to the software in terms of software fragments, applications, extensions and critical remarks.  ...  Notably, thanks are due to Hans-Joachim Bungartz and his group at Technische Universität München who provided the longest-term environment for the development of this code.  ...  a hybrid of DFS and BFS.  ... 
doi:10.1145/3319797 fatcat:5kohpbz3w5eqhfmjg67qngdsla

Modern server ARM processors for supercomputers: A64FX and others. Initial data of benchmarks

Mikhail Borisovich Kuzminsky
2022 Program systems theory and applications  
The performance of the A64FX is compared against corresponding data for Intel Xeon Skylake and Cascade Lake, and AMD EPYC with Zen 2 and 3 (Roma and Milan), as well as Nvidia V100 and A100 GPUs.  ...  The HPC performance review focuses primarily on benchmarks and applications for the A64FX, which supports longer vectors than other ARM processors and has higher peak performance.  ...  The performance obtained at the same time on the A64FX (dependence on the number of OpenMP threads and MPI rank was also studied) was higher than that of a two-processor server with Xeon Skylake (24 cores  ... 
doi:10.25209/2079-3316-2022-13-1-131-194 fatcat:fr4ypewxnfgb5h2jtvuacxhhuq

An Adaptive Parallel Algorithm for Computing Connected Components [article]

Chirag Jain, Patrick Flick, Tony Pan, Oded Green, Srinivas Aluru
2017 arXiv   pre-print
Using large graphs with diverse topologies from domains including metagenomics, web crawl, social graph and road networks, we show that our hybrid implementation is efficient and scalable for each of the  ...  To address this challenge, we employ a heuristic that allows the algorithm to quickly predict the type of the network by computing the degree distribution and follow the optimal hybrid route.  ...  ACKNOWLEDGMENT We thank George Slota for sharing the implementation of Multistep method and helping us reproduce previous results.  ... 
arXiv:1607.06156v3 fatcat:nclknvs2vbgejggkeqmzjpqd4i
« Previous Showing results 1 — 15 out of 130 results