212 Hits in 5.9 sec

Bandwidth Efficient All-reduce Operation on Tree Topologies

Pitch Patarasuk, Xin Yuan
2007 2007 IEEE International Parallel and Distributed Processing Symposium  
We evaluate the algorithm on various clusters of workstations, including a Myrinet cluster with dual-processor SMP nodes, an InfiniBand cluster with two dual-core processors SMP nodes, and an Ethernet  ...  The proposed algorithm can be applied to several contemporary cluster environments, including high-end clusters of workstations with SMP and/or multi-core nodes and low-end Ethernet switched clusters.  ...  The system model The tree topology is a connected graph G = (V, E) with no circle, where V is the set of nodes and E is the set of edges. There is a unique path between any two nodes.  ... 
doi:10.1109/ipdps.2007.370405 dblp:conf/ipps/PatarasukY07 fatcat:v6ouncsar5acheytrt7pwteq3y

Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application

Sreeram Potluri, Dhabhaleswar K. Panda, Ping Lai, Karen Tomko, Sayantan Sur, Yifeng Cui, Mahidhar Tatineni, Karl W. Schulz, William L. Barth, Amitava Majumdar
2010 Proceedings of the 24th ACM International Conference on Supercomputing - ICS '10  
AWM-Olsen is a widely used ground motion simulation code based on a parallel finite difference solution of the 3-D velocitystress wave equation.  ...  This application runs on tens of thousands of cores and consumes several million CPU hours on the TeraGrid Clusters every year.  ...  Gopal Santhanaraman for developing the initial one-sided design and thank Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources on the Ranger that have  ... 
doi:10.1145/1810085.1810092 dblp:conf/ics/PotluriLTSCTSBMP10 fatcat:el2ik747ffhlbjtwu3w6fkdl6y

Deterministic graph-theoretic algorithm for detecting modules in biological interaction networks

Roger L. Chang, Feng Luo, Stuart Johnson, Richard H. Scheuermann
2010 International Journal of Bioinformatics Research and Applications  
A recent approach, Modules of Networks (MoNet), introduced an intuitive module definition and a clear detection method based on a ranked list of edges generated by the Girvan-Newman (G-N) algorithm.  ...  Such deficiencies limit meaningful analysis of a network.  ...  This research was supported by the National Institutes of Health N01-AI40076 and N01-AI40041. F. L. is supported by NSF EPSCoT grant EPS-0447660.  ... 
doi:10.1504/ijbra.2010.032115 pmid:20223734 fatcat:gvrb6ybgvjelbhbcsnn3dygleu

Optimization and Augmentation for Data Parallel Contour Trees

Hamish Carr, Oliver Rubel, Gunther H Weber, James Ahrens
2021 IEEE Transactions on Visualization and Computer Graphics  
performance on average 6 times faster than the state-of-the-art parallel algorithm in the TTK topological toolkit.  ...  We therefore introduce a representation called the hyperstructure that enables efficient searches through the contour tree and use it to construct a fully augmented contour tree in data parallel, with  ...  ACKNOWLEDGMENTS We acknowledge EPSRC Grant EP/J013072/1 and the University of Leeds for the first author's study leave at Los Alamos National Laboratory.  ... 
doi:10.1109/tvcg.2021.3064385 pmid:33684039 fatcat:vuuzegvi5vbxtmws2mdsshawby

Understanding Application Performance via Micro-benchmarks on Three Large Supercomputers: Intrepid, Ranger and Jaguar

Abhinav Bhatelé, Lukasz Wesolowski, Eric Bohm, Edgar Solomonik, Laxmikant V. Kalé
2010 The international journal of high performance computing applications  
The peak unidirectional bandwidth on each torus link is 425 MB/s which gives a total of 5.1 GB/s shared between 4 cores of each node.  ...  The nodes can be used in three different modes: (1) VN mode, where one process runs on each core, (2) DUAL mode where two processes run per node and multiple threads can be fired per process and (3) SMP  ...  Accounts on Jaguar were made available via the Performance Evaluation and Analysis Consortium End Station, a DOE INCITE project.  ... 
doi:10.1177/1094342010370603 fatcat:dxhenihwgvfsfa5fsdw3it67lu

KNN-DBSCAN: a DBSCAN in high dimensions [article]

Youguang Chen, William Ruys, George Biros
2020 arXiv   pre-print
One of the most successful and broadly used algorithms is DBSCAN, a density-based clustering algorithm.  ...  We can cluster one billion points in 3D in less than one second on 28K cores on the Frontera system at the Texas Advanced Computing Center (TACC).  ...  In [2] , the author presents a theoretical analysis of the performance and convergence of DBSCAN with increasing n.  ... 
arXiv:2009.04552v1 fatcat:suklblx7dvawzjz2lee7azjc7a

A Parallel Algorithm for Exact Bayesian Structure Discovery in Bayesian Networks [article]

Yetian Chen, Jin Tian, Olga Nikolova, Srinivas Aluru
2016 arXiv   pre-print
) in the Bayesian network is n and the in-degree (the number of parents) per node is bounded by a constant d.  ...  Using dynamic programming (DP), the fastest known sequential algorithm computes the exact posterior probabilities of structural features in O(2(d+1)n2^n) time and space, if the number of nodes (variables  ...  Acknowledgments This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575.  ... 
arXiv:1408.1664v3 fatcat:utsr4ncwdrfr7ncwkv75pezziq

Distributed-Memory Hierarchical Compression of Dense SPD Matrices

Chenhan D. Yu, Severin Reiz, George Biros
2018 SC18: International Conference for High Performance Computing, Networking, Storage and Analysis  
We present different usage scenarios on a selection of SPD matrices that are related to graphs, neural-networks, and covariance operators.  ...  But GOFMM supports only shared memory parallelism. In this paper, we use the message passing interface (MPI) and extend the ideas of GOFMM to the distributed memory setting.  ...  Runtime dependency analysis: To help exploit the parallelism of tree-based algorithms in different granularities, MPI-GOFMM employs a self-contained runtime system.  ... 
doi:10.1109/sc.2018.00018 fatcat:fawrwh3lfjeebg5meygmykxbn4

D5.1: Market and Technology Watch Report Year 1

Jean-Philippe Nominé
2016 Zenodo  
It is thus the continuation of a well-established effort, using assessment of the HPC market based on market surveys, supercomputing conferences, and exchanges with vendors and between experts involved  ...  It aims at delivering information and guidance useful for decision makers at different levels.  ...  Especially Mellanox with its Spectrum line of switches (Figure 28 ) supports the building of systems with edge switches and server cards running on 25Gb/s while the uplinks from edge switches to core  ... 
doi:10.5281/zenodo.6801690 fatcat:zpnjoenqkvb2te74rvci326vba

Performance of the Unstructured-Mesh, SWAN+ADCIRC Model in Computing Hurricane Waves and Surge

J. C. Dietrich, S. Tanaka, J. J. Westerink, C. N. Dawson, R. A. Luettich, M. Zijlema, L. H. Holthuijsen, J. M. Smith, L. G. Westerink, H. J. Westerink
2011 Journal of Scientific Computing  
cores.  ...  The performance is tested on a variety of platforms, via the examination of output file J Sci Comput requirements and management, and the establishment of wall-clock times and scalability using up to 9,216  ...  Acknowledgements This work was supported by awards from the National Science Foundation (DMS-0915223, OCI-0749015 and OCI-0746232).  ... 
doi:10.1007/s10915-011-9555-6 fatcat:jduwqtgtu5citc4x2hgechc544


Joaquin Chung, Wojciech Zacherek, AJ Wisniewski, Zhengchun Liu, Tekin Bicer, Rajkumar Kettimuthu, Ian Foster
2022 Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing  
In response, we propose here SciStream, a middlebox-based architecture with control protocols to enable efficient and secure memory-to-memory data streaming between producers and consumers that lack direct  ...  But efficient and secure memory-to-memory data streaming is challenging to realize in practice, because of a lack of direct external network connectivity for scientific instruments and because of authentication  ...  We perform our experiments at TACC because it has nodes with IB support.  ... 
doi:10.1145/3502181.3531475 fatcat:pbrg6mcx7zei3nv54vt5iqkh34

PVFMM: A Parallel Kernel Independent FMM for Particle and Volume Potentials

Dhairya Malhotra, George Biros
2015 Communications in Computational Physics  
We measure efficiency of our method in terms of CPU cycles per unknown for different accuracies and different kernels.  ...  We also demonstrate scalability of our implementation up to several thousand processor cores on the Stampede platform at the Texas Advanced Computing Center.  ...  This material is based upon work supported by AFOSR grants FA9550-12-10484 and FA9550-11-10339; and NSF grants CCF-1337393, OCI-1029022, and OCI-1047980; and by the U.S.  ... 
doi:10.4208/cicp.020215.150515sw fatcat:bagey7wevzgtzg4ocx5syyotce

A Multistaged Hyperparallel Optimization of the Fuzzy-Logic Mechanistic Model of Molecular Regulation [article]

Paul Aiyetan
2020 bioRxiv   pre-print
Motivation: Although it circumvents hyperparameter estimation of ordinary differential equation (ODE) based models and the complexities of many other models, the computational time complexity of a fuzzy  ...  This undermines the benefits inherent in the simplicity and strength of the fuzzy logic-based molecular regulatory inference approach.  ...  supported by the National Science Foundation at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin, are greatly acknowledged.  ... 
doi:10.1101/2020.09.28.315986 fatcat:4fxbbpkg5vbrnkfdm5gifzsv6m

Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems [article]

Jonathan Vincent, Jing Gong, Martin Karp, Adam Peplinski, Niclas Jansson, Artur Podobas, Andreas Jocksch, Jie Yao, Fazle Hussain, Stefano Markidis, Matts Karlsson, Dirk Pleiter (+2 others)
2021 arXiv   pre-print
The test case considered consists of a direct numerical simulation of fully-developed turbulent flow in a straight pipe, at two different Reynolds numbers Re_τ=360 and Re_τ=550, based on friction velocity  ...  The performance results show that speed-up between 3-5 can be achieved using the GPU accelerated version compared with the CPU version on these different systems.  ...  For a 3d decomposition of the mesh the situation is a different one. We imagine every block to be a cube and merge 8 of those cubes into one of double edge length.  ... 
arXiv:2109.03592v3 fatcat:6e75xxahnfhpxn3plml6lradf4

QPACE 2 and Domain Decomposition on the Intel Xeon Phi [article]

Paul Arts, Jacques Bloch, Peter Georg, Benjamin Glaessle, Simon Heybrock, Yu Komatsubara, Robert Lohmayer, Simon Mages, Bernhard Mendl, Nils Meyer, Alessio Parcianello, Dirk Pleiter (+6 others)
2015 arXiv   pre-print
We give some general recommendations for how to write high-performance code for the Xeon Phi and then discuss our implementation of a domain-decomposition-based solver and present a number of benchmarks  ...  We give an overview of QPACE 2, which is a custom-designed supercomputer based on Intel Xeon Phi processors, developed in a collaboration of Regensburg University and Eurotech.  ...  The obvious way to improve upon the strong-scaling behavior is to switch to another algorithm, based on domain decomposition, that moves less data between KNC and memory, and between different KNCs.  ... 
arXiv:1502.04025v1 fatcat:s24tkzmwivhynfqu3x3cbohlpy
« Previous Showing results 1 — 15 out of 212 results