Filters








428 Hits in 6.6 sec

Poster reception---Optimized collectives for PGAS languages with one-sided communication

Dan Bonachea, Paul Hargrove, Rajesh Nishtala, Michael Welcome, Katherine Yelick
2006 Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06  
support (eg. hardware broadcast) • Collective interface specifically designed for PGAS Languages • Data movement: Broadcast, Scatter, Gather, Gather-All, Transpose • Computational: Reduce, Prefix-Reduce  ...  designed for PGAS Languages • Data movement: Broadcast, Scatter, Gather, Gather-All, Transpose • Computational: Reduce, Prefix-Reduce • Superset of collective support in UPC and Titanium languages • Extensible  ... 
doi:10.1145/1188455.1188604 dblp:conf/sc/BonacheaHNWY06 fatcat:s7pvxh7sgve5rmis3dg6353r2u

UCX: An Open Source Framework for HPC Network APIs and Beyond

Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L. Graham, Liran Liss, Yiftah Shahar, Sreeram Potluri (+8 others)
2015 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects  
UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware.  ...  With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second.  ...  We want to thank Stephen Poole, a co-founder of this project, who helped us with countless hours of technical discussions that made this project a reality.  ... 
doi:10.1109/hoti.2015.13 dblp:conf/hoti/ShamisVLBHIDSGL15 fatcat:63cwg7qguvfp5olammokrdk7ze

Efficient RDMA-based multi-port collectives on multi-rail QsNet/sup II/ clusters

Ying Qian, A. Afsahi
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
The proposed multiport all-to-all performs better than the native elan_alltoall by a factor of 2.19 for 16KB message. Moreover, we have also proposed two algorithms for the scatter operation.  ...  Many scientific applications use MPI collective communications intensively.  ...  Acknowledgments The authors would like to thank the anonymous referees for their insightful comments.  ... 
doi:10.1109/ipdps.2006.1639563 dblp:conf/ipps/QianA06 fatcat:e4qplaqdzjb4vc4c6x6nwfcloy

Design of High Performance MVAPICH2: MPI2 over InfiniBand

W. Huang, G. Santhanaraman, H.-W. Jin, Q. Gao, D.K. Panda
2006 Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)  
MPICH2 provides a layered architecture for implementing MPI-2. In this paper, we provide a new design for implementing MPI-2 over InfiniBand by extending the MPICH2 ADI3 layer.  ...  Our new design aims to achieve high performance by providing a multi-communication method framework that can utilize appropriate communication channels/devices to attain optimal performance without compromising  ...  In future, the framework we propose can also be extended to incorporate optimized algorithms for collectives which directly utilize the hardware capabilities of InfiniBand, such as hardware multicast  ... 
doi:10.1109/ccgrid.2006.32 dblp:conf/ccgrid/HuangSJGP06 fatcat:2pusrbyiqndvta2kqj3w3ujuiq

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation [chapter]

Richard L. Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, Ami Marelli, Valentin Petrov (+3 others)
2020 Lecture Notes in Computer Science  
), or MPI Allreduce() which is used to gather equal-sized vectors from all members of the collective group, produce a single output vector, and return this to all members of the group.  ...  This paper describes the new hardware-based streamingaggregation capability added to Mellanox's Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand switches.  ...  They include the message size restrictions imposed by InfiniBand and HCA capabilities, such as the gather/scatter capabilities.  ... 
doi:10.1007/978-3-030-50743-5_3 fatcat:zcadrld7hzcm3izseyc65o5ld4

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? [article]

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda
2017 arXiv   pre-print
In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK.  ...  However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes  ...  NCCL's API closely resembles the MPI interface and provides communication primitives for broadcast, all-gather, reduce, reduce-scatter, and all-reduce.  ... 
arXiv:1707.09414v1 fatcat:lqh3x46v7jcqvkxjdasjxxxqda

pMR: A high-performance communication library [article]

Peter Georg, Daniel Richtmann, Tilo Wettig
2017 arXiv   pre-print
Its lightweight nature that avoids some of the unnecessary overhead introduced by MPI allows us to improve the communication performance of applications without any algorithmic or complicated implementation  ...  We present a novel high-performance communication library that can be used as a de facto drop-in replacement for MPI in existing software.  ...  In the case of QPACE 2, which uses InfiniBand FDR for off-chip communication, this includes using Remote Direct Memory Access (RDMA) capabilities.  ... 
arXiv:1701.08521v1 fatcat:wv3hhyswyjefze3pkvydobjise

Towards A Data Centric System Architecture: SHARP

2017 Supercomputing Frontiers and Innovations  
The use of UD-Multicast to distribute aggregation results is introduced, reducing the letency of an eight-byte MPI Allreduce() across 128 nodes by 16%.  ...  Use of reduction trees that avoid the inter-socket bus further improves the eight-byte MPI Allreduce() latency across 128 nodes, with 28 processes per node, by 18%.  ...  These include technologies such as SHARP for handling data reduction and aggregation, hardware-based tag matching and Network data hardware-gather scatter capabilities.  ... 
doi:10.14529/jsfi170401 fatcat:ul33psqlf5bltnzbzcdbbcmlni

High Performance Remote Memory Access Communication: The Armci Approach

J. Nieplocha, V. Tipparaju, M. Krishnan, D. K. Panda
2006 The international journal of high performance computing applications  
Special emphasis is placed on the latency hiding mechanisms and ability to optimize noncontiguous data transfers.  ...  Department of Energy (DOE) Advanced Computational Testing and Simulation Toolkit project and currently used and advanced as a part of the run-time layer of the DOE project, Programming Models for Scalable  ...  The RMA model is closely aligned with RDMA capabilities of modern networks (Infiniband, Myrinet, VIA, Elan), which provide hardware support to read from or write to remote memory locations.  ... 
doi:10.1177/1094342006064504 fatcat:374gflhodrgxrpaznnc3zs5fna

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

Gurkirat Kaur
2013 IOSR Journal of Computer Engineering  
InfiniBand is a well known technology, which provides high-bandwidth and lowlatency and makes optimal use of in-built features like RDMA.  ...  This paper presents the heterogeneous Linux cluster configuration & evaluates its performance using Intel's MPI Benchmarks.  ...  Figure 8: Gatherv Test In Figure 8 , we have used Gatherv test, in this test all process input X bytes & the root gather or collect X * p bytes where p is the number of processes.  ... 
doi:10.9790/0661-1548187 fatcat:pcdlvrptlnfabpwuo2ry6de2se

Automatic datatype generation and optimization

Fredrik Kjolstad, Torsten Hoefler, Marc Snir
2012 SIGPLAN notices  
MPI Datatypes provide an alternative by describing noncontiguous data layouts. This allows sophisticated hardware to retrieve data directly from application data structures.  ...  We have implemented the algorithm in a tool that transforms packing code to MPI Datatypes, and evaluated it by transforming 90 packing codes from the NAS Parallel Benchmarks.  ...  However, modern network hardware such as InfiniBand provide support for transferring non-contiguous data (scatter/gather).  ... 
doi:10.1145/2370036.2145878 fatcat:zzxpf7zc4vbzrfbi4r6b7k7cwu

Automatic datatype generation and optimization

Fredrik Kjolstad, Torsten Hoefler, Marc Snir
2012 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12  
MPI Datatypes provide an alternative by describing noncontiguous data layouts. This allows sophisticated hardware to retrieve data directly from application data structures.  ...  We have implemented the algorithm in a tool that transforms packing code to MPI Datatypes, and evaluated it by transforming 90 packing codes from the NAS Parallel Benchmarks.  ...  However, modern network hardware such as InfiniBand provide support for transferring non-contiguous data (scatter/gather).  ... 
doi:10.1145/2145816.2145878 dblp:conf/ppopp/KjolstadHS12 fatcat:k7cdzvlsjnbv7kyggmymcvotya

RDMA read based rendezvous protocol for MPI over InfiniBand

Sayantan Sur, Hyun-Wook Jin, Lei Chai, Dhabaleswar K. Panda
2006 Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06  
Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either RDMA Write or RDMA Read.  ...  In this paper, we propose several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters.  ...  Further, our application evaluation with Linpack (HPL) and NAS-SP (Class C) reveals that MPI Wait time is reduced by around 30% and 28% respectively for a 36 node InfiniBand cluster.  ... 
doi:10.1145/1122971.1122978 dblp:conf/ppopp/SurJCP06 fatcat:mj56if6ozvei3fkaz54nzzwvta

Ibdxnet: Leveraging InfiniBand in Highly Concurrent Java Applications [article]

Stefan Nothaas, Kevin Beineke, Michael Schoettner
2018 arXiv   pre-print
We compared DXNet with the Ibdxnet transport to the MPI implementations FastMPJ and MVAPICH2.  ...  Furthermore, DXNet scales well on a high load all-to-all communication with up to 8 nodes achieving a total aggregated message rate of 43.4 million messages per second for small messages and a throughput  ...  When attaching a buffer to a WR, it is attached as a scatter gather element (SGE) of a scatter gather list (SGL).  ... 
arXiv:1812.01963v1 fatcat:ddo477fzzfbilmlqjajyjq4oli

Optimizing communication overlap for high-speed networks

Costin C. Iancu, Erich Strohmaier
2007 Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '07  
Finding an optimal, performance portable implementation when using non-blocking communication primitives is non-trivial and intimidating to many application developers.  ...  Implementations based on parameters chosen by the models are able to hide over 90% of communication overhead in all cases.  ...  For a given problem setting, due to lower overhead and latency, the Quadrics hardware achieves the best results when using smaller messages than for the Infiniband hardware.  ... 
doi:10.1145/1229428.1229436 dblp:conf/ppopp/IancuS07 fatcat:44xivstxi5dbzl3ydluhn2msce
« Previous Showing results 1 — 15 out of 428 results