7,820 Hits in 4.0 sec

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models [article]

William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna
2021 arXiv   pre-print
overall training time for the target workload.  ...  expensive NICs required for the scale-out network.  ...  Reduce -scatter All-gather All-Reduce can be broken into a Reduce-Scatter followed by an All-Gather communication patterns.  ... 
arXiv:2109.11762v1 fatcat:52aunlyalba7dfe3jkl23eyxle

Introduction to special issue on "distributed sensor networks for real-time systems with adaptive configuration"

S.S Iyengar, K Chakrabarty, Hairong Qi
2001 Journal of the Franklin Institute  
The development of dynamic distributed networks for information gathering in unstructured environments is receiving a lot of interest, partly because of the availability of new sensor technology that makes  ...  The network structure design for traditional DSNs and for wireless ad hoc sensor networks (WASNs). 2.  ... 
doi:10.1016/s0016-0032(01)00027-8 fatcat:y6hmv4v77je23gld26kkqdez7i

Demotion-based exclusive caching through demote buffering

Jiesheng Wu, Pete Wyckoff, Dhabaleswar K. Panda
2003 Proceedings of the international workshop on Storage network architecture and parallel I/Os - SNAPI '03  
A maximum speedup of 1.4x over the original DEMOTE approach is achieved for some workloads. 1.08-1.15x speedups are achieved for two real-life workloads.  ...  We evaluate the performance of DEMOTE buffering using simulations across both synthetic and real-life workloads on different networks.  ...  ACKNOWLEDGMENT The authors would like to thank Theodore M. Wong for his simulator and for his answers to our questions on the DE-MOTE exclusive caching.  ... 
doi:10.1145/1162618.1162627 fatcat:25cda2vtnnc3xmzhkqxa2573om

Towards Evidence-aware Learning Design for the Integration of ePortfolios in Distributed Learning Environments

Angélica Lozano-Álvarez, Juan I. Asensio-Pérez, Guillermo Vega-Gorgojo
2013 Proceedings of the 5th International Conference on Computer Supported Education  
This paper proposes a model aimed at enhancing the description of learning activities with information about the evidences they are expected to generate.  ...  The benefits of using ePortfolios in widespread Distributed Learning Environments are hindered by two problems: students have difficulties in selecting which learning artifacts may demonstrate the acquisition  ...  Teachers' workload increases when trying to gather those work samples.  ... 
doi:10.5220/0004382804050410 dblp:conf/csedu/Lozano-AlvarezAV13 fatcat:azkq7rc4ijhm3aykv3aemerpny

HRL: Efficient and flexible reconfigurable logic for near-data processing

Mingyu Gao, Christos Kozyrakis
2016 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)  
HRL combines both coarse-grained and fine-grained logic blocks, separates routing networks for data and control signals, and uses specialized units to effectively support branch operations and irregular  ...  For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.  ...  The authors want to thank Raghu Prabhakar, Christina Delimitrou, and the anonymous reviewers for their insightful comments on earlier versions of this paper.  ... 
doi:10.1109/hpca.2016.7446059 dblp:conf/hpca/GaoK16 fatcat:46yt3s3vznd23ma4aaszp3jfdy

Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective [article]

Hengrui Zhang, Zhongming Yu, Guohao Dai, Guyue Huang, Yufei Ding, Yuan Xie, Yu Wang
2021 arXiv   pre-print
For GNN training which is usually performed concurrently with inference, intermediate data must be stored for the backward pass, consuming 91.9% of the total memory requirement.  ...  We reorganize operators to perform neural operations before the propagation, thus the redundant computation is eliminated. (2) Unified thread mapping for fusion.  ...  The recomputing score ComputationCost M emoryCost is O(|log |E| |V| |) for Gather and O(1) for Scatter and ApplyEdge.  ... 
arXiv:2110.09524v1 fatcat:uo6rtox3wjfmjeav2iocgnsjse

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? [article]

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda
2017 arXiv   pre-print
In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK.  ...  In this context, special-purpose libraries like NVIDIA NCCL have been proposed for GPU-based collective communication on dense GPU systems.  ...  NCCL's API closely resembles the MPI interface and provides communication primitives for broadcast, all-gather, reduce, reduce-scatter, and all-reduce.  ... 
arXiv:1707.09414v1 fatcat:lqh3x46v7jcqvkxjdasjxxxqda

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads [article]

Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Cristina Anderson, Alexander Breuer, Narendra Chaudhary, Abhisek Kundu, Vasimuddin Md, Sanchit Misra, Ramanarayan Mohanty, Hans Pabst (+2 others)
2021 arXiv   pre-print
This work introduces the Tensor Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of DL-workloads with high-productivity.  ...  Despite the advances in workload/hardware ecosystems, the programming methodology of DL-systems is stagnant.  ...  , ℎ Replicate columns Takes an input column/vector, replicates it a variable number of times and forms the output Gather / Scatter Gathers/Scatters rows/columns from input and forms the tensor 2D Gat  ... 
arXiv:2104.05755v2 fatcat:x6n4kys3ujcn5a6ifb2bkil2ly

Efficient HPC Data Motion via Scratchpad Memory

Kayla O Seager, Ananta Tiwari, Michael A. Laurenzano, Joshua Peraza, Pietro Cicotti, Laura Carrington
2012 2012 SC Companion: High Performance Computing, Networking Storage and Analysis  
The energy required to move data accounts for a significant portion of the energy consumption of a modern supercomputer.  ...  Because the motion of each bit throughout the memory hierarchy has a large energy and performance cost, energy efficiency will improve if we can ensure that only the bits absolutely necessary for the computation  ...  ACKNOWLEDGMENT This work was supported in part by the DOE Office of Science through the Advanced Scientific Computing Research (ASCR) award titled "Thrifty: An Exascale Architecture for Energy-Proportional  ... 
doi:10.1109/sc.companion.2012.111 dblp:conf/sc/SeagerTLPCC12 fatcat:4bwayaibuvfy3erlrlqaibd2qa

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training [article]

Youngeun Kwon, Yunjae Lee, Minsoo Rhu
2020 arXiv   pre-print
We then propose our algorithm-architecture co-design called Tensor Casting, which enables the development of a generic accelerator architecture for tensor gather-scatter that encompasses all the key primitives  ...  As such, architectural solutions for high-performance recommendation inference have recently been the target of several prior literatures.  ...  the design cost of TPUs by specializing its microarchitecture for GEMMs.  ... 
arXiv:2010.13100v1 fatcat:kt7vrmg7ezhijgdsvoqjywwkye

SOAR: Minimizing Network Utilization with Bounded In-network Computing [article]

Raz Segal, Chen Avin, Gabriel Scalosub
2021 arXiv   pre-print
We formulate and study the problem of activating a limited number of in-network computing devices within a network, aiming at reducing the overall network utilization for a given workload.  ...  Such limitations on the number of in-network computing elements per workload arise, e.g., in incremental upgrades of network infrastructure, and are also due to requiring specialized middleboxes, or FPGAs  ...  ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers and our shepherd, Shay Vargaftik, for their valuable feedback which helped improve the paper.  ... 
arXiv:2110.14224v1 fatcat:xhvoxnd5tbd4zgyup5dv2cmtza

Design and analysis of data management in scalable parallel scripting

Zhao Zhang, Daniel S. Katz, Justin M. Wozniak, Allan Espinosa, Ian Foster
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
Thus, we design and implement a scalable MTC data management system that uses aggregated compute node local storage for more efficient data movement strategies.  ...  in the shared filesystem.  ...  Computing resources were provided by the Argonne Leadership Computing Facility. We thank Dr. David Mathog (Caltech) for his support with parallel BLAST, and the ALCF support team.  ... 
doi:10.1109/sc.2012.44 dblp:conf/sc/ZhangKWEF12 fatcat:mdukzucq7jf33ebsjggjetd7qi

A Software Data Transport Framework for Trigger Applications on Clusters [article]

Timm M. Steinbeck, Volker Lindenstruth, Heinz Tilsner (Kirchhoff Institute of Physics, Ruprecht-Karls-University Heidelberg, Germany, for the ALICE Collaboration)
2003 arXiv   pre-print
The HLT system that is being designed to cope with these data rates consists of a large PC cluster, up to the order of a 1000 nodes, connected by a fast network.  ...  First performance tests show very promising results for the software, indicating that it can achieve an event rate for the data transport sufficiently high to satisfy ALICE's requirements.  ...  Acknowledgments Work on the ALICE High Level Trigger has been financed by the German Federal Ministry of Education and Research (BMBF) as part of its program "Förderschwerpunkt Hadronen-und Kernphysik  ... 
arXiv:cs/0306029v1 fatcat:gfdsngwnw5aibbwmjnjkvknnvm

Towards Efficient Large-Scale Graph Neural Network Computing [article]

Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, Yafei Dai
2018 arXiv   pre-print
NGra presents a new SAGA-NN model for expressing deep neural networks as vertex programs with each layer in well-defined (Scatter, ApplyEdge, Gather, ApplyVertex) graph operation stages.  ...  We introduce NGra, the first parallel processing framework for graph-based deep neural networks (GNNs).  ...  a SAGA-NN (Scatter-ApplyEdge-Gather-ApplyVertex with Neural Networks) vertex-program abstraction.  ... 
arXiv:1810.08403v1 fatcat:qvybtgioife7zarswcnppu5vrm

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines [article]

Xinyu Chen, Yao Chen, Feng Cheng, Hongshi Tan, Bingsheng He, Weng-Fai Wong
2022 arXiv   pre-print
The use of FPGAs for efficient graph processing has attracted significant interest.  ...  In this paper, we re-examined the graph processing workloads and found much diversity in processing.  ...  SYSTEM ARCHITECTURE To support various graph algorithms, our system adopts the popular Gather-Apply-Scatter (GAS) model, which contains three stages for each iteration: the Scatter, the Gather, and the  ... 
arXiv:2203.02676v1 fatcat:yvkfsuxstnhczbjevtnnqpuzge
« Previous Showing results 1 — 15 out of 7,820 results