748 Hits in 5.4 sec

Probabilistic Communication and I/O Tracing with Deterministic Replay at Scale

Xing Wu, Karthik Vijayakumar, Frank Mueller, Xiaosong Ma, Philip C. Roth
2011 2011 International Conference on Parallel Processing  
This work contributes Scala-H-Trace, which features more aggressive trace compression than any previous approach, particularly for applications that do not show strict regularity in SPMD behavior.  ...  Even with the aggressively compressed histogram-based traces, our replay times are within 12% to 15% of the runtime of original codes.  ...  As a result, post-processing/analysis can be performed without decompression. We utilize this concept of structure preserving compression in Scala-H-Trace.  ... 
doi:10.1109/icpp.2011.50 dblp:conf/icpp/WuVMMR11 fatcat:qd6yfdhjmnagvftvvvvvkkdfju

Performance prediction with skeletons

Sukhdeep Sodhi, Jaspal Subhlok, Qiang Xu
2007 Cluster Computing  
The goal of this research is accurate performance estimation in heterogeneous and shared computational grids. { MPI Irecv  ...  The performance skeleton of an application is a short running program whose performance in any scenario reflects the performance of the application it represents.  ...  Tradeoffs between the degree of compression and compression time are possible and may be necessary for long running programs with frequent communication calls.  ... 
doi:10.1007/s10586-007-0039-2 fatcat:63y3r57hrjfxrnywvo6suxcsg4

Memory Compression Techniques for Network Address Management in MPI

Yanfei Guo, Charles J. Archer, Michael Blocksome, Scott Parker, Wesley Bland, Ken Raffenetti, Pavan Balaji
2017 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)  
In current MPI implementations the management and lookup of such network addresses uses memory sizes that are proportional to the number of processes in each communicator.  ...  AV-Rankmap takes advantage of logical patterns in rank-address mapping that most applications naturally tend to have, and it exploits the fact that some parts of network address structures are naturally  ...  in both the network address and rankaddress mapping structures; and (4) it performs such memory compression with no practically observable performance degradation.  ... 
doi:10.1109/ipdps.2017.18 dblp:conf/ipps/GuoABPBRB17 fatcat:wjzl7i6qezdcdhn775phyhmkji

A study of the effects of machine geometry and mapping on distributed transpose performance

Maria Eleftheriou, Blake G. Fitch, Aleksandr Rayshubskiy, T.J. Christopher Ward, Phillip Heidelberger, Robert S. Germain
2008 Proceedings of the 2008 conference on Computing frontiers - CF '08  
Measurements also show that the proposed approach is effective in improving Particle-Mesh-based Nbody simulation performance significantly at the limits of scalability.  ...  Performance measurements of the standalone 3D FFT on two communication protocols, MPI and BG/L ADE [19] are presented.  ...  in the FFT phases • construction of the appropriate group communicators in the MPI based approach.  ... 
doi:10.1145/1366230.1366243 dblp:conf/cf/EleftheriouFRWHG08 fatcat:azncncvaqvbovj5lc5zyk3vmym

CYPRESS: Combining Static and Dynamic Analysis for Top-Down Communication Trace Compression

Jidong Zhai, Jianfei Hu, Xiongchao Tang, Xiaosong Ma, Wenguang Chen
2014 SC14: International Conference for High Performance Computing, Networking, Storage and Analysis  
This tree naturally contains crucial iterative computing features such as the loop structure, allowing subsequent runtime compression to "fill in", in a "top-down" manner, event details into the known  ...  Communication traces are increasingly important, both for parallel applications' performance analysis/optimization, and for designing next-generation HPC systems.  ...  In China, this work has been partially supported by the National High-Tech Research and Development Plan (863 project) 2012AA010901, NSFC projects 61232008 and 61103021, MSRA joint project FY14-RES-SPONSOR  ... 
doi:10.1109/sc.2014.17 dblp:conf/sc/ZhaiHTMC14 fatcat:bjel5i4a2vaabp6vsdjao4ozla

Indexes and Computation over Compressed Structured Data (Dagstuhl Seminar 13232)

Sebastian Maneth, Gonzalo Navarro, Marc Herbstritt
2013 Dagstuhl Reports  
In this talk we review the main practical results on the use of list update algorithms for data compression.  ...  The aim was to bring together researchers from various research directions of compression and indexing of structured data.  ...  Specifically, for any i and j, we wish create a data structure that reports the positions of the largest k elements in A[i..j] in decreasing order, without accessing A at query time.  ... 
doi:10.4230/dagrep.3.6.22 dblp:journals/dagstuhl-reports/ManethN13 fatcat:b35at6erjbe63hvelnqnrt4jle

Record-and-Replay Techniques for HPC Systems: A Survey

2018 Supercomputing Frontiers and Innovations  
In the high performance computing (HPC) environment, all the above factors must be considered in concert, thus presenting additional implementation challenges.  ...  This manuscript answers three questions through this survey: What are the gaps in the existing space of record-and-replay techniques?  ...  Acknowledgements This work was performed under the auspices of the U.S. Department of Energy by LLNL under contract DE-AC52-07NA27344 (LLNL-JRNL-749010).  ... 
doi:10.14529/jsfi180102 fatcat:imgvuajy7bcarcu7zihvjhzft4

Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems [article]

Huiwei Lv, Guangming Tan, Mingyu Chen, Ninghui Sun
2012 arXiv   pre-print
In this paper we sufficiently reduce the communication cost in distributed BFS by compressing and sieving the messages.  ...  Second, we propose a novel distributed directory algorithm, cross directory, to sieve the redundant data in messages.  ...  There is a space-time tradeoff among these compression schemes. Comparing to WAH, LZ77 is slower but has a better compression ratio.  ... 
arXiv:1208.5542v1 fatcat:qshsx4z3zbgrli3lvtah2ppqza

Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism

Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul H.J. Kelly, Gihan Mudalige, Brian Van Straalen, Sam Williams
2013 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum  
In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain.  ...  The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff.  ...  Moreover, synchronization points (reductions or MPI communication) endemic to high-performance dis-tributed numerical methods demand one restructure algorithms in order to reduce data movement and improve  ... 
doi:10.1109/ipdpsw.2013.68 dblp:conf/ipps/KriegerSOSGGBKMSW13 fatcat:uoexj5o3mjhr7e7cu76znqy7q4

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

Jarek Nieplocha, Bruce Palmer, Vinod Tipparaju, Manojkumar Krishnan, Harold Trease, Edoardo Aprà
2006 The international journal of high performance computing applications  
GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to what is available  ...  High performance implementations of ARMCI were developed under the ACTS project within a year for the predominant parallel systems used in the US in 1999 [16, 51] and it has been expanded and supported  ...  some cases directly became involved in the toolkit development.  ... 
doi:10.1177/1094342006064503 fatcat:qurmguotbzbvbhwncci2ytodhq

Dual-level parallelism for high-order CFD methods

Suchuan Dong, George Em Karniadakis
2004 Parallel Computing  
A hybrid two-level parallel paradigm with MPI/OpenMP is presented in the context of high-order methods and implemented in the spectral/hp element framework to take advantage of the hierarchical structures  ...  while the pure MPI model performs the best on the IBM SP3 and on the Compaq Alpha Cluster.  ...  The inherent hierarchical structures in CFD problems suggest a multi-level parallelization strategy. At the top-most level are groups of MPI processes. Each group computes one random mode.  ... 
doi:10.1016/j.parco.2003.05.020 fatcat:xtxnsubu2jgsfkr63uiokz4s2q

Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus

Gabrielle Allen, Thomas Dramlitsch, Ian Foster, Nicholas T. Karonis, Matei Ripeanu, Edward Seidel, Brian Toonen
2001 Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '01  
Improvements in the performance of processors and networks make it both feasible and interesting to treat collections of workstations, servers, clusters, and supercomputers as integrated computational  ...  We have used this framework to perform record-setting computations in numerical relativity, running across four supercomputers and achieving scaling of 88% (1140 CPU's) and 63% (1500 CPUs).  ...  This material is based in part upon work supported by the National Science Foundation under Grant No. 9975020.  ... 
doi:10.1145/582034.582086 dblp:conf/sc/AllenDFKRST01 fatcat:q7rn55jonzakhjvgrdf6xzgoci

MPI-IO/L: efficient remote I/O for MPI-IO via logistical networking

Jonghyun Lee, R. Ross, S. Atchley, M. Beck, R. Thakur
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
We show the performance tradeoffs between various remote I/O approaches implemented in the system, which can help scientists identify preferable I/O options for their own applications.  ...  This work presents MPI-IO/L, a remote I/O facility for MPI-IO by using Logistical Networking.  ...  This approach resembles the disk compression utility and requires extra network bandwidth and storage space. Efficient Noncontiguous I/O Support.  ... 
doi:10.1109/ipdps.2006.1639305 dblp:conf/ipps/LeeRABT06 fatcat:zyk7tdg5abfztdx37thzcfsy5i

The Anatomy of Large-Scale Distributed Graph Algorithms [article]

Jesun Sahariar Firoz, Thejaka Amila Kanewala, Marcin Zalewski, Martina Barnas, Andrew Lumsdaine
2015 arXiv   pre-print
The performance analysis becomes a truly experimental science, even more challenging in the presence of massive irregularity and data dependency.  ...  The increasing complexity of the software/hardware stack of modern supercomputers results in explosion of parameters.  ...  Data Structures for Algorithm Progress Algorithms use data structures to maintain intermediate state. Pearce et al.  ... 
arXiv:1507.06702v1 fatcat:yolns423c5fxhcsgkoeghrxgte

Performance of Natural I/O Applications [chapter]

Stevan Vlaovic, Richard Uhlig
2000 Workload Characterization for Computer System Design  
As a group, the natural I/O applications outperformed the SPEC95 benchmarks when it comes to overall branch predictor performance.  ...  In addition to the four natural I/O applications, five integer SPEC95 benchmarks for comparison: go, gcc, perl, compress, and vortex.  ...  This illustrates the performance accuracy tradeoff that we mentioned earlier. Cache Size Misses per Instruction (mpi) speech Full Speech Acc. Speech FIGURE 9 .  ... 
doi:10.1007/978-1-4615-4387-9_6 fatcat:d5o534nydvgl5ps6rkuxd5x5he
« Previous Showing results 1 — 15 out of 748 results