The Internet Archive has digitized a microfilm copy of this work. It may be possible to borrow a copy for reading.
Filters
Page 2041 of Mathematical Reviews Vol. , Issue 99c
[page]
1991
Mathematical Reviews
Although useful heuristics have been proposed for such systems, we present the first exact analysis of broadcast and summation on hierarchical ring architectures.”
99c:68020 68M10
Misic, Jelena (PRC-HKST-C ...
The computations of both our potential function and our packet- canceling policy are totally local in nature.”
99c:68019 68M10
Michail, Amir (1-WA-CE; Seattle, WA)
Optimal broadcast and summation on hierarchical ...
A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters
2020
USENIX Symposium on Operating Systems Design and Implementation
Summation Service can be accelerated by AVX instructions and can be efficiently run on CPUs, while DNN model-related optimizer algorithms are run on GPUs for computation acceleration. ...
It introduces a Summation Service abstraction for aggregating gradients, which is common for all the optimizers. ...
We find that running the full optimizers on CPU can be a bottleneck. We divide the computation of optimizers and only put summation on CPUs. We will elaborate the rationale of this design in §5. ...
dblp:conf/osdi/JiangZLYCG20
fatcat:udl2ksqorfetdaznv5jzhlonte
Broadcast and Weight: An Integrated Network For Scalable Photonic Spike Processing
2014
Journal of Lightwave Technology
We propose an on-chip optical architecture to support massive parallel communication among high-performance spiking laser neurons. ...
Broadcast-and-weight is a new approach for combining neuromorphic processing and optoelectronic physics, a pairing that is found to yield a variety of advantageous features. ...
Hierarchical organization of the waveguide broadcast architecture showing a scalable modular structure. Colored rectangles represent PNNs. ...
doi:10.1109/jlt.2014.2345652
fatcat:4mwhpxwj4ra3bow77rxxqr4wve
A Highly Efficient Distributed Deep Learning System for Automatic Speech Recognition
2019
Interspeech 2019
Further, we proposed a Hierarchical-ADPSGD (H-ADPSGD) system in which learners on the same computing node construct a super learner via a fast allreduce implementation, and super learners deploy ADPSGD ...
On a 64 Nvidia V100 GPU cluster connected via a 100Gb/s Ethernet network, our system is able to train SWB-2000 to reach a 7.6% WER on the Hub5-2000 Switchboard (SWB) test-set and a 13.2% WER on the Callhome ...
Sync-Ring
Learners in local Sync-Ring
ADPSGD-Ring GPUs in ADPSGD-Ring Fig. 3 : System architecture for H-ADPSGD. ...
doi:10.21437/interspeech.2019-2700
dblp:conf/interspeech/ZhangCFSKBK0P19
fatcat:gkkbrfpmsrbgtfvvuwoaglirje
Communication Performance of Mesh- and Ring-Based NoCs
2008
Seventh International Conference on Networking (icn 2008)
Two potential topologies of networks on chip (NoC) are investigated, a ring-based network and 2D-mesh, due to their easy manufacturability on a chip. ...
As multi-core systems begin to appear, their possible applications, parallel performance and onchip interconnection networks have to be clarified, analyzed and optimized. ...
of Czech Republic, "Architectures of Embedded Systems Network", GA102/05/0467, and "Security-Oriented Research in Information Technology" Ministry of Education, MSM 0021630528. ...
doi:10.1109/icn.2008.53
dblp:conf/icn/Dvorak08
fatcat:wf6qbpooqbbs5d36pmklwvgitm
Software Libraries for Linear Algebra Computations on High Performance Computers
1995
SIAM Review
Block-partitioned versions of the Cholesky and LU factorizations are presented, and optimization issues associated with the implementation of the LU factorization algorithm on distributed memory concurrent ...
These block operations can be optimized for each architecv ...
Acknowledgments The authors are grateful for the comments and suggestions of the anonymous referees. ...
doi:10.1137/1037042
fatcat:zuhqkhn2tbbfhgris6os2t6l7y
Design of Low Energy, High Performance Synchronous and Asynchronous 64-Point FFT
2013
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013
A case study exploring multi-frequency design is presented for a low energy and high performance FFT circuit implementation. An FFT architecture with concurrent data stream computation is selected. ...
An asynchronous and synchronous implementations for a 16-point and a 64-point FFT circuit were designed and compared for energy, performance and area. ...
A primary aspect of optimizing performance of an asynchronous architecture is to calculate the critical paths and focus on those. ...
doi:10.7873/date.2013.062
dblp:conf/date/LeeVTS13
fatcat:utnv7racm5gerpuib2qfna5tgm
DeepScaffold: a comprehensive tool for scaffold-based de novo drug discovery using deep learning
[article]
2019
arXiv
pre-print
skeletons, as well as scaffolds with specifications on side-chain properties. ...
The model can generalize the learned chemical rules of adding atoms and bonds to a given scaffold. ...
MLP broadcast is a linear layer with BN-ReLU-Linear architecture. • Gathering: The information broadcasted to each edge is again gathered to each node. ...
arXiv:1908.07209v4
fatcat:vbp2yoo5pjhufi3i5t4sym23k4
Parallel computer vision on a reconfigurable multiprocessor network
1997
IEEE Transactions on Parallel and Distributed Systems
The architecture is shown to contain 2D mesh topologies of varying sizes and also a single one-factor of the Boolean hypercube in any given configuration. ...
A large class of algorithms for the 2D mesh and the Boolean n-cube are shown to map efficiently on the proposed architecture without loss of performance. ...
Sarmad Abbasi for his comments and discussions on some of the algorithms and the proofs of some of the theorems in this paper. ...
doi:10.1109/71.584095
fatcat:o7q2qhadlzaljiu4ehukq3fw7y
Towards Parallel Computing on the Internet: Applications, Architectures, Models and Programming Tools
[article]
2006
arXiv
pre-print
In this survey we cover the three fundamental aspects -- application, architecture and model, and we show how they have been developed over the last decade. ...
A number of parallel computing models exist that address this for traditional parallel architectures, and there are a number of emerging models that attempt to do this for large scale Internet-based systems ...
: Broadcast and summation. ...
arXiv:cs/0612105v2
fatcat:cgttdbvuurbvbb2zjxqz6h5ehy
Amplitude Phase Shift Keying Constellation Design and its Applications to Satellite Digital Video Broadcasting
[chapter]
2010
Digital Video
of the relative radius and phase shift of each -th ring with respect to the inner ring to the optimized ones found in the equiprobable case (see Section 3.2 and Tables 1-3), a new constellation design ...
To this end, assuming equiprobable constellation points on each -th ring which allows different a priori probabilities on different rings, a new APSK constellation design optimization problem is formulated ...
techniques making use of MIMO, hierarchical modulation and lossy compression. ...
doi:10.5772/8042
fatcat:unej7mdy3najzn33a5qg67s5de
Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models
[article]
2021
arXiv
pre-print
Further, we address different challenges of the DNN training on hierarchical networks. ...
DNN sizes and training samples are constantly growing, making training of such workloads more challenging. Distributed training is a solution to reduce the training time. ...
All-Gather, on the other hand, broadcasts data residing on each NPU to all other NPUs. ...
arXiv:2109.11762v1
fatcat:52aunlyalba7dfe3jkl23eyxle
Design of a large-scale Gbit/s MAN using a cyclic reservation-based MAC protocol
2000
Journal of systems architecture
In this paper, a large-scale Gbit/s metropolitan area network (MAN) based on hierarchical ring topologies has been investigated. ...
The network is constituted by backbone and local rings, which are connected by bridges. ...
Network architecture The single hierarchical ring shown in Fig. 1 consists of a number of local rings, which can connect a large number of user nodes. ...
doi:10.1016/s1383-7621(00)00013-8
fatcat:rl7fx6ccknf6fp2xzdsfyopxfe
DataScalar: A memory-centric approach to computing
1999
Journal of systems architecture
All processors run the same program, broadcasting operands they own to the other processors when needed, and performing any tasks that can be accomplished entirely on-chip without off-chip communication ...
Each node accesses operands in its fast local memory and broadcasts them to the other nodes. ...
Acknowledgments The authors thank Alain Kägi, Scott Breach, Babak Falsafi, Steve Reinhardt, and T.N. Vijaykumar for their helpful discussions and intellectual contributions to this work. ...
doi:10.1016/s1383-7621(98)00048-4
fatcat:zjhccit4w5g4pdkibko5tsiqbi
UNION: A unified inter/intra-chip optical network for chip multiprocessors
2010
2010 IEEE/ACM International Symposium on Nanoscale Architectures
Jointly designing communication architectures for both interchip and intrachip communication could, however, potentially yield better solutions. ...
Traditionally, to maximize design flexibility, interchip and intrachip communication architectures are separately designed under different constraints. ...
[24] proposed Firefly architecture as a hybrid hierarchical on-chip network. ...
doi:10.1109/nanoarch.2010.5510930
dblp:conf/nanoarch/WuY0LNWX10
fatcat:65ctsqm2qbcavofpqodtfznwcq
« Previous
Showing results 1 — 15 out of 725 results