A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Characterizing and Understanding PDES Behavior on Tilera Architecture
2012
2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
The emergence of manycore architectures with shifting balance between computation and communication overhead can have a tremendous impact on performance and scalability of fine-grained parallel applications ...
Finally, we explore the issues of object placement and model partitioning on Tilera architecture. ...
The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air ...
doi:10.1109/pads.2012.10
dblp:conf/pads/JagtapBPA12
fatcat:nfrxqav7cbcctcrhclhhbxiu2y
Navigating an Evolutionary Fast Path to Exascale
2012
2012 SC Companion: High Performance Computing, Networking Storage and Analysis
The computing community is in the midst of a disruptive architectural change. ...
Therefore, as architectures, programming models, and programming mechanisms continue to evolve, the preparations described herein will provide significant performance benefits on existing and emerging ...
ACKNOWLEDGEMENTS The breadth of our work has required special efforts from a variety of entities and staff within the Department of Energy and with our industrial collaborators. ...
doi:10.1109/sc.companion.2012.55
dblp:conf/sc/BarrettHVDHLR12
fatcat:3frq3n526vccbmfcpkoneb4edu
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
2011
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. ...
Next, we present a variety of parallel optimization approaches including programming model exploration (flat MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed ...
Finally, to illustrate the impact of communication on performance, and provide apples-to-apples comparisons between architectures, we explore 3 progressively larger datasets: 1, 4, and 16GB per node. ...
doi:10.1145/2063384.2063458
dblp:conf/sc/WilliamsOCS11
fatcat:e7snov63dvfklehytpcqbfb7xi
Parallel Programming Model for the Epiphany Many-Core Coprocessor Using Threaded MPI
[article]
2015
arXiv
pre-print
the importance of fast inter-core communication for the architecture. ...
The Adapteva Epiphany many-core architecture comprises a 2D tiled mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. ...
ACKNOWLEDGMENTS The authors wish to acknowledge the U.S. Army Research Laboratory-hosted Department of Defense Supercomputing Resource Center for its support of this work. ...
arXiv:1506.05442v1
fatcat:edidr7vxd5cglgeaieprywbdgm
On the Efficiency of Executing Hydro-environmental Models on Cloud
2016
Procedia Engineering
Many-core capability is provided by the OpenMP library in a hybrid configuration with MPI for cross-node data movement, and we explore the combination of these in the target setup. ...
For the MPI part, the work flow is implemented as a data-parallel execution model, with all processing elements performing the same computation, on different subdomains with thread-level, fine-grain parallelism ...
Many-core capability is provided by the MPI and OpenMP libraries in a hybrid configuration, and we explore the combination of these in the target setup. ...
doi:10.1016/j.proeng.2016.07.447
fatcat:hz6mhyojovcq5cgjxc3umk5woa
Exploring power behaviors and trade-offs of in-situ data analytics
2013
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
model based on system power and data exchange patterns, which is empirically validated; and (3) the use of the model to characterize the energy behavior of the workflow and to explore energy/performance ...
The goal of this paper is exploring data-related energy/performance trade-offs for end-to-end simulation workflows running at scale on current high-end computing systems. ...
Acknowledgments The research presented in this work is supported in part by the Director, Office of Advanced ...
doi:10.1145/2503210.2503303
dblp:conf/sc/GamellRPBKCBLGMPPK13
fatcat:fdt5gmyd6vby7cali23vpm6hb4
Benchmarking a Many-Core Neuromorphic Platform With an MPI-Based DNA Sequence Matching Algorithm
2019
Electronics
across the many cores of the platform. ...
Experimental results indicate that the SpiNNaker parallel architecture allows a linear performance increase with the number of used cores and shows better scalability compared to a general-purpose multi-core ...
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/electronics8111342
fatcat:yjsmlxwqtrh2pht53mcz3wux2e
MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics
2008
2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Conclusions and Future Work Optimizing MPI collective communication on emerging multicore clusters is the key to obtaining good performance speed-ups for many parallel applications. ...
Understanding the impact of these architectures on communication performance is crucial to designing efficient collective algorithms. ...
doi:10.1109/ccgrid.2008.87
dblp:conf/ccgrid/MamidalaKDP08
fatcat:if6w35nuxfcgthgj25crgcn2ia
Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems
2014
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14
State-of-the-art MPI libraries rely on locks to guarantee thread-safety. This discourages application developers from using multiple threads to perform MPI operations. ...
In this paper, we propose a high performance, lock-free multiendpoint MPI runtime, which can achieve up to 40% improvement for point-to-point operation and one representative collective operation with ...
Introduction MPI/OpenMP hybrid programming model is widely regarded as suitable model for scaling parallel applications on emerging multi-/many-core computing architectures. ...
doi:10.1145/2555243.2555287
dblp:conf/ppopp/LuoLHKP14
fatcat:lpr6lccpbrclfgcgedvzrikptq
Optimization of Parallel Discrete Event Simulator for Multi-core Systems
2012
2012 IEEE 26th International Parallel and Distributed Processing Symposium
Results show that multithreaded implementation improves performance over the MPI version by up to a factor of 3 for the Core i7 machine and 1.2 on Magny-cours for 48-way simulation. ...
We study the performance of the simulator on two hardware platforms: a Core i7 machine and a 48-core AMD Opteron Magny-Cours system. ...
The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies and endorsements, either expressed or implied, of Air ...
doi:10.1109/ipdps.2012.55
dblp:conf/ipps/JagtapAP12
fatcat:cd7kymsruragdjxse7khyrky7a
Threaded MPI programming model for the Epiphany RISC array processor
2015
Journal of Computational Science
Using MPI we demonstrate an on-chip performance of 9.1 GFLOPS with an efficiency of 15.3 GFLOPS/W. ...
We present experimental results for matrix-matrix multiplication using MPI and highlight the importance of fast inter-core data transfers. ...
Therefore, it is interesting to explore the utility of MPI for programming on-chip parallelism. ...
doi:10.1016/j.jocs.2015.04.023
fatcat:bmycj4ivzjbifkemmle24ggl7i
Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters
2006
2006 IEEE International Conference on Cluster Computing
As new processor and memory architectures advance, clusters start to be built from larger SMP systems, which makes MPI intra-node communication a critical issue in high performance computing. ...
While running the bandwidth benchmark, the measured L2 cache miss rate is reduced by half. The new design also improves the performance of MPI collective calls by up to 25%. ...
Software Distribution: The design proposed in this paper will be available for downloading in upcoming MVAPICH releases. ...
doi:10.1109/clustr.2006.311850
dblp:conf/cluster/ChaiHP06
fatcat:odigrebu7fe55p34wmzurss73m
Parallel Discrete Event Simulation for Multi-Core Systems: Analysis and Optimization
2014
IEEE Transactions on Parallel and Distributed Systems
Our results show that multithreaded implementation improves the performance over an MPI-based version by up to a factor of 3 on the Core i7, 1.4 on the AMD Magny-Cours, and 2.8 on the Tilera Tile64. ...
We study the performance of the simulator on three hardware platforms: an Intel Core i7 machine, and a 48-core AMD Opteron Magny-Cours system, and a 64-core Tilera TilePro64. ...
He received his PhD from the University of Cincinnati in 1997. Dmitry Ponomarev is an Associate Professor in the Department of Computer Science at SUNY Binghamton. ...
doi:10.1109/tpds.2013.193
fatcat:bphisz5u6baobhhxgb44p2ahta
Modeling Ion Channel Kinetics with HPC
2010
2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC)
The focus of our study is to examine the step--by-step process of adapting new and existing computational biology models to multicore and distributed memory architectures. ...
Performance improvements for computational sciences such as biology, physics, and chemistry are critically dependent on advances in multicore and manycore hardware. ...
the Department of Pediatrics and The Children's Hospital Research Institute (TAB and AG). ...
doi:10.1109/hpcc.2010.46
dblp:conf/hpcc/GehrkeRBCR10
fatcat:cbihbi6lfjhmzlmcxefv7vqdoq
Performance Modeling of Gyrokinetic Toroidal Simulations for a Many-Tasking Runtime System
[chapter]
2014
Lecture Notes in Computer Science
Yet a priori estimation of the potential performance and scalability impact of such runtime systems on existing applications developed around the bulk synchronous parallel (BSP) model is not well understood ...
Conventional programming practices on multicore processors in high performance computing architectures are not universally effective in terms of efficiency and scalability for many algorithms in scientific ...
There are numerous performance studies on the MPI version of GTC [26] , [27] across a wide array of architectures making it an ideal candidate for this case study. ...
doi:10.1007/978-3-319-10214-6_7
fatcat:rpvyc7r6dbcvte6dm4q74cbl4i
« Previous
Showing results 1 — 15 out of 3,487 results