A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Probabilistic Communication and I/O Tracing with Deterministic Replay at Scale
2011
2011 International Conference on Parallel Processing
This work contributes Scala-H-Trace, which features more aggressive trace compression than any previous approach, particularly for applications that do not show strict regularity in SPMD behavior. ...
Even with the aggressively compressed histogram-based traces, our replay times are within 12% to 15% of the runtime of original codes. ...
As a result, post-processing/analysis can be performed without decompression. We utilize this concept of structure preserving compression in Scala-H-Trace. ...
doi:10.1109/icpp.2011.50
dblp:conf/icpp/WuVMMR11
fatcat:qd6yfdhjmnagvftvvvvvkkdfju
Performance prediction with skeletons
2007
Cluster Computing
The goal of this research is accurate performance estimation in heterogeneous and shared computational grids. { MPI Irecv ...
The performance skeleton of an application is a short running program whose performance in any scenario reflects the performance of the application it represents. ...
Tradeoffs between the degree of compression and compression time are possible and may be necessary for long running programs with frequent communication calls. ...
doi:10.1007/s10586-007-0039-2
fatcat:63y3r57hrjfxrnywvo6suxcsg4
Memory Compression Techniques for Network Address Management in MPI
2017
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
In current MPI implementations the management and lookup of such network addresses uses memory sizes that are proportional to the number of processes in each communicator. ...
AV-Rankmap takes advantage of logical patterns in rank-address mapping that most applications naturally tend to have, and it exploits the fact that some parts of network address structures are naturally ...
in both the network address and rankaddress mapping structures; and (4) it performs such memory compression with no practically observable performance degradation. ...
doi:10.1109/ipdps.2017.18
dblp:conf/ipps/GuoABPBRB17
fatcat:wjzl7i6qezdcdhn775phyhmkji
A study of the effects of machine geometry and mapping on distributed transpose performance
2008
Proceedings of the 2008 conference on Computing frontiers - CF '08
Measurements also show that the proposed approach is effective in improving Particle-Mesh-based Nbody simulation performance significantly at the limits of scalability. ...
Performance measurements of the standalone 3D FFT on two communication protocols, MPI and BG/L ADE [19] are presented. ...
in the FFT phases • construction of the appropriate group communicators in the MPI based approach. ...
doi:10.1145/1366230.1366243
dblp:conf/cf/EleftheriouFRWHG08
fatcat:azncncvaqvbovj5lc5zyk3vmym
CYPRESS: Combining Static and Dynamic Analysis for Top-Down Communication Trace Compression
2014
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
This tree naturally contains crucial iterative computing features such as the loop structure, allowing subsequent runtime compression to "fill in", in a "top-down" manner, event details into the known ...
Communication traces are increasingly important, both for parallel applications' performance analysis/optimization, and for designing next-generation HPC systems. ...
In China, this work has been partially supported by the National High-Tech Research and Development Plan (863 project) 2012AA010901, NSFC projects 61232008 and 61103021, MSRA joint project FY14-RES-SPONSOR ...
doi:10.1109/sc.2014.17
dblp:conf/sc/ZhaiHTMC14
fatcat:bjel5i4a2vaabp6vsdjao4ozla
Indexes and Computation over Compressed Structured Data (Dagstuhl Seminar 13232)
2013
Dagstuhl Reports
In this talk we review the main practical results on the use of list update algorithms for data compression. ...
The aim was to bring together researchers from various research directions of compression and indexing of structured data. ...
Specifically, for any i and j, we wish create a data structure that reports the positions of the largest k elements in A[i..j] in decreasing order, without accessing A at query time. ...
doi:10.4230/dagrep.3.6.22
dblp:journals/dagstuhl-reports/ManethN13
fatcat:b35at6erjbe63hvelnqnrt4jle
Record-and-Replay Techniques for HPC Systems: A Survey
2018
Supercomputing Frontiers and Innovations
In the high performance computing (HPC) environment, all the above factors must be considered in concert, thus presenting additional implementation challenges. ...
This manuscript answers three questions through this survey: What are the gaps in the existing space of record-and-replay techniques? ...
Acknowledgements This work was performed under the auspices of the U.S. Department of Energy by LLNL under contract DE-AC52-07NA27344 (LLNL-JRNL-749010). ...
doi:10.14529/jsfi180102
fatcat:imgvuajy7bcarcu7zihvjhzft4
Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems
[article]
2012
arXiv
pre-print
In this paper we sufficiently reduce the communication cost in distributed BFS by compressing and sieving the messages. ...
Second, we propose a novel distributed directory algorithm, cross directory, to sieve the redundant data in messages. ...
There is a space-time tradeoff among these compression schemes. Comparing to WAH, LZ77 is slower but has a better compression ratio. ...
arXiv:1208.5542v1
fatcat:qshsx4z3zbgrli3lvtah2ppqza
Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism
2013
2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum
In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. ...
The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. ...
Moreover, synchronization points (reductions or MPI communication) endemic to high-performance dis-tributed numerical methods demand one restructure algorithms in order to reduce data movement and improve ...
doi:10.1109/ipdpsw.2013.68
dblp:conf/ipps/KriegerSOSGGBKMSW13
fatcat:uoexj5o3mjhr7e7cu76znqy7q4
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
2006
The international journal of high performance computing applications
GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to what is available ...
High performance implementations of ARMCI were developed under the ACTS project within a year for the predominant parallel systems used in the US in 1999 [16, 51] and it has been expanded and supported ...
some cases directly became involved in the toolkit development. ...
doi:10.1177/1094342006064503
fatcat:qurmguotbzbvbhwncci2ytodhq
Dual-level parallelism for high-order CFD methods
2004
Parallel Computing
A hybrid two-level parallel paradigm with MPI/OpenMP is presented in the context of high-order methods and implemented in the spectral/hp element framework to take advantage of the hierarchical structures ...
while the pure MPI model performs the best on the IBM SP3 and on the Compaq Alpha Cluster. ...
The inherent hierarchical structures in CFD problems suggest a multi-level parallelization strategy. At the top-most level are groups of MPI processes. Each group computes one random mode. ...
doi:10.1016/j.parco.2003.05.020
fatcat:xtxnsubu2jgsfkr63uiokz4s2q
Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus
2001
Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '01
Improvements in the performance of processors and networks make it both feasible and interesting to treat collections of workstations, servers, clusters, and supercomputers as integrated computational ...
We have used this framework to perform record-setting computations in numerical relativity, running across four supercomputers and achieving scaling of 88% (1140 CPU's) and 63% (1500 CPUs). ...
This material is based in part upon work supported by the National Science Foundation under Grant No. 9975020. ...
doi:10.1145/582034.582086
dblp:conf/sc/AllenDFKRST01
fatcat:q7rn55jonzakhjvgrdf6xzgoci
MPI-IO/L: efficient remote I/O for MPI-IO via logistical networking
2006
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
We show the performance tradeoffs between various remote I/O approaches implemented in the system, which can help scientists identify preferable I/O options for their own applications. ...
This work presents MPI-IO/L, a remote I/O facility for MPI-IO by using Logistical Networking. ...
This approach resembles the disk compression utility and requires extra network bandwidth and storage space. Efficient Noncontiguous I/O Support. ...
doi:10.1109/ipdps.2006.1639305
dblp:conf/ipps/LeeRABT06
fatcat:zyk7tdg5abfztdx37thzcfsy5i
The Anatomy of Large-Scale Distributed Graph Algorithms
[article]
2015
arXiv
pre-print
The performance analysis becomes a truly experimental science, even more challenging in the presence of massive irregularity and data dependency. ...
The increasing complexity of the software/hardware stack of modern supercomputers results in explosion of parameters. ...
Data Structures for Algorithm Progress Algorithms use data structures to maintain intermediate state. Pearce et al. ...
arXiv:1507.06702v1
fatcat:yolns423c5fxhcsgkoeghrxgte
Performance of Natural I/O Applications
[chapter]
2000
Workload Characterization for Computer System Design
As a group, the natural I/O applications outperformed the SPEC95 benchmarks when it comes to overall branch predictor performance. ...
In addition to the four natural I/O applications, five integer SPEC95 benchmarks for comparison: go, gcc, perl, compress, and vortex. ...
This illustrates the performance accuracy tradeoff that we mentioned earlier. Cache Size
Misses per Instruction (mpi) speech Full Speech Acc. Speech FIGURE 9 . ...
doi:10.1007/978-1-4615-4387-9_6
fatcat:d5o534nydvgl5ps6rkuxd5x5he
« Previous
Showing results 1 — 15 out of 748 results