42,946 Hits in 5.6 sec

Optimizing bandwidth limited problems using one-sided communication and overlap

C. Bell, D. Bonachea, R. Nishtala, K. Yelick
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
In this paper we show that the one-sided communication model used in these languages also has a significant performance advantage for bandwidth-limited applications.  ...  Our optimizations rely on aggressively overlapping communication with computation but spreading communication events throughout the course of the local computation.  ...  Optimizing Bandwidth-Limited Applications In this section we consider a problem that is often hailed as the canonical example of a problem limited by bisection bandwidth, the 3D FFT.  ... 
doi:10.1109/ipdps.2006.1639320 dblp:conf/ipps/BellBNY06 fatcat:33xs5aiegvgezkxiw77wn2ob6u

Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap

Rajesh Nishtala, Paul H. Hargrove, Dan O. Bonachea, Katherine A. Yelick
2009 2009 IEEE International Symposium on Parallel & Distributed Processing  
We demonstrate that the PGAS model, using a new port of the Berkeley UPC compiler and GASNet one-sided communication layer, outperforms two-sided (MPI) communication in both microbenchmarks and a case  ...  /P communication layer for supporting one-sided communication and PGAS languages.  ...  Acknowledgements We would like to thank Michael Blocksome, Douglas Miller, Sameer Kumar and the entire IBM DCMF team for their support in helping us port GASNet to BG/P.  ... 
doi:10.1109/ipdps.2009.5161076 dblp:conf/ipps/NishtalaHBY09 fatcat:wqopvxul3jeezk3ppoub7kdztm

Optimizing communication overlap for high-speed networks

Costin C. Iancu, Erich Strohmaier
2007 Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '07  
We believe that algorithm design and optimization techniques that hide latency by taking advantage of communication overlap will facilitate obtaining good parallel efficiency and performance on the highly  ...  We believe that due to the levels of concurrencies proposed for Petascale systems, efficient use of non-blocking communication including overlapping will be one of the keys for achieving good performance  ...  Despite the lower latency and higher bandwidth on Elan networks, when using non-blocking communication, scalability is affected by the small TLB size and limited memory footprint.  ... 
doi:10.1145/1229428.1229436 dblp:conf/ppopp/IancuS07 fatcat:44xivstxi5dbzl3ydluhn2msce

Optimization and performance evaluation of the IDR iterative Krylov solver on GPUs

Hartwig Anzt, Moritz Kreutzer, Eduardo Ponce, Gregory D Peterson, Gerhard Wellein, Jack Dongarra
2016 The international journal of high performance computing applications  
We improve data locality, combine it with an efficient sparse matrix vector kernel, and investigate the potential of overlapping computation with communication as well as the possibility of concurrent  ...  A comprehensive performance evaluation is conducted using a suitable performance model.  ...  Larger problems provide more parallelism, which brings the achieved bandwidth closer to the maximum bandwidth the roofline performance model is based on (see Section 6).  ... 
doi:10.1177/1094342016646844 fatcat:oyteuyn6qfdw7cf5ya25uag5jq

Speeding up NGB with distributed file streaming framework

Bingchen Li, Kang Chen, Zhiteng Huang, H.L. Rajic, R.H. Kuhn
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
By studying I/O patterns of NGB codes we have identified program locations where it is possible to overlap computation and data workflow phases.  ...  In addition to the challenges it provides, it also offers new opportunities for optimization.  ...  Acknowledgements The authors would like to thank Eric Huang and Wenguang Chen for their comments during the early stages of this study.  ... 
doi:10.1109/ipdps.2006.1639655 dblp:conf/ipps/LiCHRK06 fatcat:kgdl735csjdppmgknutpykeww4

Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming

Gerald Schubert, Georg Hager, Holger Fehske, Gerhard Wellein
2011 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum  
Starting from the observation that nonblocking MPI is not able to hide communication cost using standard MPI implementations, we demonstrate that explicit overlap of communication and computation can be  ...  achieved by using a dedicated communication thread, which may run on a virtual core.  ...  Keller and T. Schönemeyer for valuable discussions, A. Basermann for providing the RCM transformation, and K. Stüben and H. J. Plum for providing and supporting the AMG test case.  ... 
doi:10.1109/ipdps.2011.332 dblp:conf/ipps/SchubertHFW11 fatcat:64mqcflvdbf7blzuwamb3hbkgi

Finite Duration Root Nyquist Pulses with Maximum In-Band Fractional Energy

G. Nigam, R. Singh, A.K. Chaturvedi
2010 IEEE Communications Letters  
We design root Nyquist pulses having maximum inband fractional energy for a given finite bandwidth. The problem of maximizing the ratio of in-band energy to total energy has been dealt with earlier.  ...  But an exact solution could not be found since it involved optimization of a quadratic objective function with quadratic constraints.  ...  We have limitations of channel bandwidth, so we need to concentrate most of its energy in a finite bandwidth to use the maximum spectral resources.  ... 
doi:10.1109/lcomm.2010.09.100314 fatcat:h4qh2c5fqvcxhdtgsn5j5r2mce

Wideband Printed Monopole Design Using a Genetic Algorithm

M. John, M. J. Ammann
2007 IEEE Antennas and Wireless Propagation Letters  
The parasitic elements optimize the effective feedgap between the radiator on one side and the groundplane on the other side.  ...  ANTENNA GEOMETRY The microstrip-fed GA plate monopole is printed on one side of FR4 substrate of 1.52 mm thickness and metalization 1536-1225/$25.00 © 2007 IEEE Authorized licensed use limited to: DUBLIN  ... 
doi:10.1109/lawp.2007.891962 fatcat:iqkjktpuyjcwbcdbvyalptzvoe

A preliminary evaluation of the hardware acceleration of the cray gemini interconnect for PGAS languages and comparison with MPI

Hongzhang Shan, Nicholas J. Wright, John Shalf, Katherine Yelick, Marcus Wagner, Nathan Wichmann
2011 Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems - PMBS '11  
rate, aggregate bandwidth, and computation and communication overlap capability.  ...  The study also reveals important information about how to optimize one-sided Gemini communication.  ...  We also measured the messaging rate using get instead of put for CAF. In the bandwidth limit, as one might expect, the get and put performance is identical.  ... 
doi:10.1145/2088457.2088467 fatcat:4fxwk2i35ngwdcnwbotdpsmv3i

Productivity and performance using partitioned global address space languages

Katherine Yelick, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta (+4 others)
2007 Proceedings of the 2007 international workshop on Parallel symbolic computation - PASCO '07  
Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet.  ...  The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors.  ...  HAND-OPTIMIZED BENCHMARKS The performance benefits of one-sided communication are not limited to microbenchmarks.  ... 
doi:10.1145/1278177.1278183 dblp:conf/issac/YelickBCCDDGHHHIKNSWW07 fatcat:hpedjb24vvfkbpi7fbawt6xf4u

A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications

Vladimir Subotic, Jose Carlos Sancho, Jesus Labarta, Mateo Valero
2010 2010 IEEE International Conference on Cluster Computing  
Valgrind instruments the legacy MPI application and generates the execution traces, then Dimemas uses the obtained traces and reconstructs the application's time-behavior on a configurable parallel platform  ...  of simulated time behaviors, that further allows useful comparisons of the non-overlapped and the overlapped executions.  ...  One solution to optimize network usage is to overlap communication delays with useful computation of the application.  ... 
doi:10.1109/cluster.2010.33 dblp:conf/cluster/SuboticSLV10 fatcat:egsuajz4mzfy3awxwju657xqwe

Analyzing communication models for distributed thread-collaborative processors in terms of energy and time

Benjamin Klenk, Lena Oden, Holger Froning
2015 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)  
In this work, we analyze data movement optimizations for distributed heterogeneous systems based on CPUs and GPUs.  ...  Insights include that (1) specialized models offer substantial advantages for a variety of workloads, (2) thread-collaborative models only seem to be limited by reduced overlap possibilities, and (3) a  ...  ACKNOWLEDGMENT We gratefully acknowledge the generous support of this research effort by Nvidia, Xilinx Inc, and the EXTOLL Corporation.  ... 
doi:10.1109/ispass.2015.7095817 dblp:conf/ispass/KlenkOF15 fatcat:nztvqp3njfb3bidilhlnnbxxay

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training [article]

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy
2018 arXiv   pre-print
and communication resources.  ...  We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation  ...  a communication-bound workload.  ... 
arXiv:1805.07891v1 fatcat:jrur6u3vjfgrxpfi6lialuhoru


Anrin Chakraborti, Radu Sion
2016 Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security - CCS'16  
Although several tree-based ORAMs such as PathORAM [8] and RingORAM [6] have achieved near-optimal bandwidth for single client scenarios, their low overall throughput due to high latency of access -as  ...  with privacy (position map) and designing everything else using append-only data structures that can be then merged securely in a separate eviction step.  ...  RingO-RAM [6] further optimizes PathORAM [8] for practical deployment by reducing the bandwidth complexity constants. Problem Definition.  ... 
doi:10.1145/2976749.2989062 dblp:conf/ccs/ChakrabortiS16 fatcat:2iwjhh2vbzczhnfifgdbjwkkpm

Performance portable optimizations for loops containing communication operations

Costin Iancu, Wei Chen, Katherine Yelick
2008 Proceedings of the 22nd annual international conference on Supercomputing - ICS '08  
Effective use of communication networks is critical to the performance and scalability of parallel applications.  ...  Studies of well-tuned programs have suggested that PGAS languages are effective at utilizing modern networks because their one-sided communication is a good match to the underlying network hardware.  ...  We are aware of only one other compiler effort to exploit overlap for loop nests using one sided communication. This is work performed by Paek and presented in his PhD Thesis.  ... 
doi:10.1145/1375527.1375567 dblp:conf/ics/IancuCY08 fatcat:ufzgdktrj5civkicf27kfuenoy
« Previous Showing results 1 — 15 out of 42,946 results