1,194 Hits in 5.7 sec

An extensible global address space framework with decoupled task and data abstractions

Sriram Krishnamoorthy, Umit Catalyurek, Jarek Nieplocha, Atanas Rountev, P. Sadayappan
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
Locality-aware load balancing of tasks in the task pool is achieved through judicious mapping via hyper-graph partitioning, as well as dynamic task/data migration.  ...  The use of the framework for implementation of parallel block-sparse tensor computations in the context of a quantum chemistry application is illustrated.  ...  Acknowledgments We thank the National Science Foundation for the support of this research through grants 0121676, 0403342, and 0509467, and the U.S.  ... 
doi:10.1109/ipdps.2006.1639577 dblp:conf/ipps/KrishnamoorthyCNRS06 fatcat:ofshuvpcqbaidmh4tljzvn67aq

Inspector/executor load balancing algorithms for block-sparse tensor contractions

David Ozog, Sameer Shende, Allen Malony, Jeff R. Hammond, James Dinan, Pavan Balaji
2013 Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13  
to centralized dynamic load balancing.  ...  Architecture-specific and empirically driven performance models of the dominant SORT and DGEMM routines serve as a cost estimator for a once-per-simulation static partitioning process.  ...  Total MFLOPS for each task in a single CCSD T 2 tensor contraction for a water monomer simulation. This is a good overall indicator of load imbalance for this particular tensor contraction.  ... 
doi:10.1145/2464996.2467282 dblp:conf/ics/OzogSMHDB13 fatcat:3jk6oyhbqngnzntnwescslxrb4

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov
2017 Journal of Parallel and Distributed Computing  
These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations.  ...  handling load-imbalance, tasking and bulk synchronous models.  ...  We would like to thank the anonymous reviewers for providing suggestions to get a better performance of NWChem runs and improve the presentation in this manuscript.  ... 
doi:10.1016/j.jpdc.2017.02.010 fatcat:mcrxnl4b2vaslg7r35sodabn3e

Scioto: A Framework for Global-View Task Parallelism

James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, P. Sadayappan
2008 2008 37th International Conference on Parallel Processing  
Through task parallelism, the Scioto framework provides a solution for overcoming irregularity, load imbalance, and heterogeneity as well as dynamic mapping of computation onto emerging architectures.  ...  We introduce Scioto, Shared Collections of Task Objects, a lightweight framework for providing task management on distributed memory machines under one-sided and globalview parallel programming models.  ...  Scioto offers an alternative means for expressing parallelism through shared collections of task objects and provides locality-aware dynamic load balancing.  ... 
doi:10.1109/icpp.2008.44 dblp:conf/icpp/DinanKLNS08 fatcat:d2uk46pzwndglgq3vdfnlmqage

NumS: Scalable Array Programming for the Cloud [article]

Melih Elibol, Vinamra Benara, Samyu Yagati, Lianmin Zheng, Alvin Cheung, Michael I. Jordan, Ion Stoica
2022 arXiv   pre-print
Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances  ...  However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement  ...  Acknowledgments Thank you to Amazon Core AI and Microsoft for supporting the evaluation of NumS by providing AWS and Azure credits.  ... 
arXiv:2206.14276v2 fatcat:izlxioiiujdi3pil6qng3auvle

Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

Edgar Solomonik, Devin Matthews, Jeff Hammond, James Demmel
2013 2013 IEEE 27th International Symposium on Parallel and Distributed Processing  
Cyclops (cyclic-operations) Tensor Framework (CTF) 1 is a distributed library for tensor contractions.  ...  The mapping framework decides on the best mapping for each tensor contraction at run-time via explicit calculations of memory usage and communication volume.  ...  Since this dynamically scheduled scheme is not load balanced, NWChem uses dynamic load balancing among the processors.  ... 
doi:10.1109/ipdps.2013.112 dblp:conf/ipps/SolomonikMHD13 fatcat:ql44ablwgvgotea3zl6cuknnda

A massively parallel tensor contraction framework for coupled-cluster computations

Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton, James Demmel
2014 Journal of Parallel and Distributed Computing  
Each contraction may be executed via matrix multiplication on a properly ordered and structured tensor. However, data transpositions are often needed to reorder the tensors for each contraction.  ...  We present a distributed-memory numerical library (Cyclops Tensor Framework (CTF)) that automatically manages tensor blocking and redistribution to perform any user-specified contractions.  ...  tensor contractionsa load-balanced blocking scheme for symmetric tensors • optimized redistribution kernels for symmetric tensors • an expressive and compact tensor domain specific language (DSL) •  ... 
doi:10.1016/j.jpdc.2014.06.002 fatcat:76at7oi2vfhbxfc6tmbzoe2xyy

Accelerating NWChem Coupled Cluster Through Dataflow-Based Execution [chapter]

Heike Jagode, Anthony Danalis, George Bosilca, Jack Dongarra
2016 Lecture Notes in Computer Science  
Furthermore, we argue how the CC algorithms can be easily decomposed into finer grained tasks (compared to the original version of NWChem); and how data distribution and load balancing are decoupled and  ...  We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables  ...  The Dynamic Load-balanced Tensor Contractions framework [8] has been designed with the goal to provide dynamic task partitioning for tensor contraction expressions.  ... 
doi:10.1007/978-3-319-32149-3_35 fatcat:6bwtlgnw3nb5pjaanvikxs4y6q

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [article]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen
2020 arXiv   pre-print
It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code.  ...  GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler.  ...  Acknowledgements We would like to thank the Google Brain and Translate teams for their useful input and insightful discussions, entire XLA and Lingvo development teams for their foundational contributions  ... 
arXiv:2006.16668v1 fatcat:tucpisgorneq3gbikveukhqxri

Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure

Thomas Herault, Yves Robert, George Bosilca, Robert J. Harrison, Cannada A. Lewis, Edward F. Valeev, Jack J. Dongarra
2021 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)  
In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime  ...  High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented  ...  a good load-balance of the computations. 2) Partition into Blocks: Once the columns of B have been assigned to the processors, they are divided into blocks which are assigned to GPUs.  ... 
doi:10.1109/ipdps49936.2021.00062 fatcat:4d5uwxkmkfdatkhfo76vbgcs5a

Fine-grained Locality-aware Parallel Scheme for Anisotropic Mesh Adaptation

Hoby Rakotoarivelo, Franck Ledoux, Franck Pommereau
2016 Procedia Engineering  
Data dependencies are expressed by a graph for each kernel, and concurrency is extracted through fine-grained graph coloring.  ...  The devised scheme was evaluated on a 4 NUMA node (2-socket) machine, and a mean efficiency of 70% was reached on 32 cores for 3 kernels out of 4.  ...  Acknowledgement A special thanks to Nicolas Le-Goff for his assistance through all steps of this work.  ... 
doi:10.1016/j.proeng.2016.11.035 fatcat:zom63bc53bc43mn42dbsnebvuq

Accelerating NWChem Coupled Cluster through dataflow-based execution

Heike Jagode, Anthony Danalis, Jack Dongarra
2017 The international journal of high performance computing applications  
Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with the original version of NWCHEM); and how data distribution and load balancing are decoupled  ...  We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWCHEM, concluding that the utilization of dataflow-based execution for CC methods enables  ...  A portion of this research was performed using EMSL, a DOE Office of Science User Facility sponsored by the Office of Biological and Environmental Research and located at Pacific Northwest National Laboratory  ... 
doi:10.1177/1094342016672543 fatcat:nl7okrxudvcp7dmnynm7pxbxxe

Scaling up Hartree–Fock calculations on Tianhe-2

Edmond Chow, Xing Liu, Sanchit Misra, Marat Dukhan, Mikhail Smelyanskiy, Jeff R. Hammond, Yunfei Du, Xiang-Ke Liao, Pradeep Dubey
2015 The international journal of high performance computing applications  
We describe a general framework for finding a well-balanced static partitioning of the load in the presence of screening. Work stealing is used to polish the load balance.  ...  A major issue is load balance, which is made challenging due to integral screening.  ...  Computer time for development on Stampede was provided under NSF XSEDE (grant number TG-CCR140016).  ... 
doi:10.1177/1094342015592960 fatcat:6kwdfl62trcjvdh7yyjkl54ud4

A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs [article]

Fareed Qararyah, Mohamed Wahib, Doğa Dikbayır, Mehmet Esat Belviranli, Didem Unat
2020 arXiv   pre-print
It partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to a few minutes.  ...  We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for large DNN models that do not fit into single device memory.ParDNN decides a placement of DNN's underlying computational  ...  Exploration of Exascale Computing (eX3), which is nancially supported by the Research Council of Norway under contract 270053.  ... 
arXiv:2008.08636v1 fatcat:eqifuw4mpbg6dafprd4ubgwsde

PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution

Anthony Danalis, Heike Jagode, George Bosilca, Jack Dongarra
2015 2015 IEEE International Conference on Cluster Computing  
Task-based execution has been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era.  ...  In this paper, we discuss the use of PARSEC to convert a part of the Coupled Cluster (CC) component of the Quantum Chemistry package NWCHEM into a task-based form.  ...  Other efforts to improve load balancing and scalability of tensor contractions, such as the Cyclops Tensor Framework [25] , or the Dynamic Load-balanced Tensor Contractions framework [26] offer orthogonal  ... 
doi:10.1109/cluster.2015.50 dblp:conf/cluster/DanalisJBD15 fatcat:dbteukyqzbh33k3k2okldocmme
« Previous Showing results 1 — 15 out of 1,194 results