A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models
[article]
2021
arXiv
pre-print
overall training time for the target workload. ...
expensive NICs required for the scale-out network. ...
Reduce -scatter All-gather All-Reduce can be broken into a Reduce-Scatter followed by an All-Gather communication patterns. ...
arXiv:2109.11762v1
fatcat:52aunlyalba7dfe3jkl23eyxle
Introduction to special issue on "distributed sensor networks for real-time systems with adaptive configuration"
2001
Journal of the Franklin Institute
The development of dynamic distributed networks for information gathering in unstructured environments is receiving a lot of interest, partly because of the availability of new sensor technology that makes ...
The network structure design for traditional DSNs and for wireless ad hoc sensor networks (WASNs). 2. ...
doi:10.1016/s0016-0032(01)00027-8
fatcat:y6hmv4v77je23gld26kkqdez7i
Demotion-based exclusive caching through demote buffering
2003
Proceedings of the international workshop on Storage network architecture and parallel I/Os - SNAPI '03
A maximum speedup of 1.4x over the original DEMOTE approach is achieved for some workloads. 1.08-1.15x speedups are achieved for two real-life workloads. ...
We evaluate the performance of DEMOTE buffering using simulations across both synthetic and real-life workloads on different networks. ...
ACKNOWLEDGMENT The authors would like to thank Theodore M. Wong for his simulator and for his answers to our questions on the DE-MOTE exclusive caching. ...
doi:10.1145/1162618.1162627
fatcat:25cda2vtnnc3xmzhkqxa2573om
Towards Evidence-aware Learning Design for the Integration of ePortfolios in Distributed Learning Environments
english
2013
Proceedings of the 5th International Conference on Computer Supported Education
english
This paper proposes a model aimed at enhancing the description of learning activities with information about the evidences they are expected to generate. ...
The benefits of using ePortfolios in widespread Distributed Learning Environments are hindered by two problems: students have difficulties in selecting which learning artifacts may demonstrate the acquisition ...
Teachers' workload increases when trying to gather those work samples. ...
doi:10.5220/0004382804050410
dblp:conf/csedu/Lozano-AlvarezAV13
fatcat:azkq7rc4ijhm3aykv3aemerpny
HRL: Efficient and flexible reconfigurable logic for near-data processing
2016
2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
HRL combines both coarse-grained and fine-grained logic blocks, separates routing networks for data and control signals, and uses specialized units to effectively support branch operations and irregular ...
For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application. ...
The authors want to thank Raghu Prabhakar, Christina Delimitrou, and the anonymous reviewers for their insightful comments on earlier versions of this paper. ...
doi:10.1109/hpca.2016.7446059
dblp:conf/hpca/GaoK16
fatcat:46yt3s3vznd23ma4aaszp3jfdy
Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective
[article]
2021
arXiv
pre-print
For GNN training which is usually performed concurrently with inference, intermediate data must be stored for the backward pass, consuming 91.9% of the total memory requirement. ...
We reorganize operators to perform neural operations before the propagation, thus the redundant computation is eliminated. (2) Unified thread mapping for fusion. ...
The recomputing score ComputationCost M emoryCost is O(|log |E| |V| |) for Gather and O(1) for Scatter and ApplyEdge. ...
arXiv:2110.09524v1
fatcat:uo6rtox3wjfmjeav2iocgnsjse
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
[article]
2017
arXiv
pre-print
In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK. ...
In this context, special-purpose libraries like NVIDIA NCCL have been proposed for GPU-based collective communication on dense GPU systems. ...
NCCL's API closely resembles the MPI interface and provides communication primitives for broadcast, all-gather, reduce, reduce-scatter, and all-reduce. ...
arXiv:1707.09414v1
fatcat:lqh3x46v7jcqvkxjdasjxxxqda
Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads
[article]
2021
arXiv
pre-print
This work introduces the Tensor Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of DL-workloads with high-productivity. ...
Despite the advances in workload/hardware ecosystems, the programming methodology of DL-systems is stagnant. ...
, ℎ
Replicate columns Takes an input column/vector, replicates it a variable number of times and forms the output Gather / Scatter Gathers/Scatters rows/columns from input and forms the tensor 2D Gat ...
arXiv:2104.05755v2
fatcat:x6n4kys3ujcn5a6ifb2bkil2ly
Efficient HPC Data Motion via Scratchpad Memory
2012
2012 SC Companion: High Performance Computing, Networking Storage and Analysis
The energy required to move data accounts for a significant portion of the energy consumption of a modern supercomputer. ...
Because the motion of each bit throughout the memory hierarchy has a large energy and performance cost, energy efficiency will improve if we can ensure that only the bits absolutely necessary for the computation ...
ACKNOWLEDGMENT This work was supported in part by the DOE Office of Science through the Advanced Scientific Computing Research (ASCR) award titled "Thrifty: An Exascale Architecture for Energy-Proportional ...
doi:10.1109/sc.companion.2012.111
dblp:conf/sc/SeagerTLPCC12
fatcat:4bwayaibuvfy3erlrlqaibd2qa
Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training
[article]
2020
arXiv
pre-print
We then propose our algorithm-architecture co-design called Tensor Casting, which enables the development of a generic accelerator architecture for tensor gather-scatter that encompasses all the key primitives ...
As such, architectural solutions for high-performance recommendation inference have recently been the target of several prior literatures. ...
the design cost of TPUs by specializing its microarchitecture for GEMMs. ...
arXiv:2010.13100v1
fatcat:kt7vrmg7ezhijgdsvoqjywwkye
SOAR: Minimizing Network Utilization with Bounded In-network Computing
[article]
2021
arXiv
pre-print
We formulate and study the problem of activating a limited number of in-network computing devices within a network, aiming at reducing the overall network utilization for a given workload. ...
Such limitations on the number of in-network computing elements per workload arise, e.g., in incremental upgrades of network infrastructure, and are also due to requiring specialized middleboxes, or FPGAs ...
ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers and our shepherd, Shay Vargaftik, for their valuable feedback which helped improve the paper. ...
arXiv:2110.14224v1
fatcat:xhvoxnd5tbd4zgyup5dv2cmtza
Design and analysis of data management in scalable parallel scripting
2012
2012 International Conference for High Performance Computing, Networking, Storage and Analysis
Thus, we design and implement a scalable MTC data management system that uses aggregated compute node local storage for more efficient data movement strategies. ...
in the shared filesystem. ...
Computing resources were provided by the Argonne Leadership Computing Facility. We thank Dr. David Mathog (Caltech) for his support with parallel BLAST, and the ALCF support team. ...
doi:10.1109/sc.2012.44
dblp:conf/sc/ZhangKWEF12
fatcat:mdukzucq7jf33ebsjggjetd7qi
A Software Data Transport Framework for Trigger Applications on Clusters
[article]
2003
arXiv
pre-print
The HLT system that is being designed to cope with these data rates consists of a large PC cluster, up to the order of a 1000 nodes, connected by a fast network. ...
First performance tests show very promising results for the software, indicating that it can achieve an event rate for the data transport sufficiently high to satisfy ALICE's requirements. ...
Acknowledgments Work on the ALICE High Level Trigger has been financed by the German Federal Ministry of Education and Research (BMBF) as part of its program "Förderschwerpunkt Hadronen-und Kernphysik ...
arXiv:cs/0306029v1
fatcat:gfdsngwnw5aibbwmjnjkvknnvm
Towards Efficient Large-Scale Graph Neural Network Computing
[article]
2018
arXiv
pre-print
NGra presents a new SAGA-NN model for expressing deep neural networks as vertex programs with each layer in well-defined (Scatter, ApplyEdge, Gather, ApplyVertex) graph operation stages. ...
We introduce NGra, the first parallel processing framework for graph-based deep neural networks (GNNs). ...
a SAGA-NN (Scatter-ApplyEdge-Gather-ApplyVertex with Neural Networks) vertex-program abstraction. ...
arXiv:1810.08403v1
fatcat:qvybtgioife7zarswcnppu5vrm
ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines
[article]
2022
arXiv
pre-print
The use of FPGAs for efficient graph processing has attracted significant interest. ...
In this paper, we re-examined the graph processing workloads and found much diversity in processing. ...
SYSTEM ARCHITECTURE To support various graph algorithms, our system adopts the popular Gather-Apply-Scatter (GAS) model, which contains three stages for each iteration: the Scatter, the Gather, and the ...
arXiv:2203.02676v1
fatcat:yvkfsuxstnhczbjevtnnqpuzge
« Previous
Showing results 1 — 15 out of 7,820 results