Filters








725 Hits in 4.9 sec

Page 2041 of Mathematical Reviews Vol. , Issue 99c [page]

1991 Mathematical Reviews  
Although useful heuristics have been proposed for such systems, we present the first exact analysis of broadcast and summation on hierarchical ring architectures.” 99c:68020 68M10 Misic, Jelena (PRC-HKST-C  ...  The computations of both our potential function and our packet- canceling policy are totally local in nature.” 99c:68019 68M10 Michail, Amir (1-WA-CE; Seattle, WA) Optimal broadcast and summation on hierarchical  ... 

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo
2020 USENIX Symposium on Operating Systems Design and Implementation  
Summation Service can be accelerated by AVX instructions and can be efficiently run on CPUs, while DNN model-related optimizer algorithms are run on GPUs for computation acceleration.  ...  It introduces a Summation Service abstraction for aggregating gradients, which is common for all the optimizers.  ...  We find that running the full optimizers on CPU can be a bottleneck. We divide the computation of optimizers and only put summation on CPUs. We will elaborate the rationale of this design in §5.  ... 
dblp:conf/osdi/JiangZLYCG20 fatcat:udl2ksqorfetdaznv5jzhlonte

Broadcast and Weight: An Integrated Network For Scalable Photonic Spike Processing

Alexander N. Tait, Mitchell A. Nahmias, Bhavin J. Shastri, Paul R. Prucnal
2014 Journal of Lightwave Technology  
We propose an on-chip optical architecture to support massive parallel communication among high-performance spiking laser neurons.  ...  Broadcast-and-weight is a new approach for combining neuromorphic processing and optoelectronic physics, a pairing that is found to yield a variety of advantageous features.  ...  Hierarchical organization of the waveguide broadcast architecture showing a scalable modular structure. Colored rectangles represent PNNs.  ... 
doi:10.1109/jlt.2014.2345652 fatcat:4mwhpxwj4ra3bow77rxxqr4wve

A Highly Efficient Distributed Deep Learning System for Automatic Speech Recognition

Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny
2019 Interspeech 2019  
Further, we proposed a Hierarchical-ADPSGD (H-ADPSGD) system in which learners on the same computing node construct a super learner via a fast allreduce implementation, and super learners deploy ADPSGD  ...  On a 64 Nvidia V100 GPU cluster connected via a 100Gb/s Ethernet network, our system is able to train SWB-2000 to reach a 7.6% WER on the Hub5-2000 Switchboard (SWB) test-set and a 13.2% WER on the Callhome  ...  Sync-Ring Learners in local Sync-Ring ADPSGD-Ring GPUs in ADPSGD-Ring Fig. 3 : System architecture for H-ADPSGD.  ... 
doi:10.21437/interspeech.2019-2700 dblp:conf/interspeech/ZhangCFSKBK0P19 fatcat:gkkbrfpmsrbgtfvvuwoaglirje

Communication Performance of Mesh- and Ring-Based NoCs

Vaclav Dvorak
2008 Seventh International Conference on Networking (icn 2008)  
Two potential topologies of networks on chip (NoC) are investigated, a ring-based network and 2D-mesh, due to their easy manufacturability on a chip.  ...  As multi-core systems begin to appear, their possible applications, parallel performance and onchip interconnection networks have to be clarified, analyzed and optimized.  ...  of Czech Republic, "Architectures of Embedded Systems Network", GA102/05/0467, and "Security-Oriented Research in Information Technology" Ministry of Education, MSM 0021630528.  ... 
doi:10.1109/icn.2008.53 dblp:conf/icn/Dvorak08 fatcat:wf6qbpooqbbs5d36pmklwvgitm

Software Libraries for Linear Algebra Computations on High Performance Computers

Jack J. Dongarra, David W. Walker
1995 SIAM Review  
Block-partitioned versions of the Cholesky and LU factorizations are presented, and optimization issues associated with the implementation of the LU factorization algorithm on distributed memory concurrent  ...  These block operations can be optimized for each architecv  ...  Acknowledgments The authors are grateful for the comments and suggestions of the anonymous referees.  ... 
doi:10.1137/1037042 fatcat:zuhqkhn2tbbfhgris6os2t6l7y

Design of Low Energy, High Performance Synchronous and Asynchronous 64-Point FFT

William Lee, Vikas S. Vij, Anthony R. Thatcher, Kenneth S. Stevens
2013 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013  
A case study exploring multi-frequency design is presented for a low energy and high performance FFT circuit implementation. An FFT architecture with concurrent data stream computation is selected.  ...  An asynchronous and synchronous implementations for a 16-point and a 64-point FFT circuit were designed and compared for energy, performance and area.  ...  A primary aspect of optimizing performance of an asynchronous architecture is to calculate the critical paths and focus on those.  ... 
doi:10.7873/date.2013.062 dblp:conf/date/LeeVTS13 fatcat:utnv7racm5gerpuib2qfna5tgm

DeepScaffold: a comprehensive tool for scaffold-based de novo drug discovery using deep learning [article]

Yibo Li, Jianxing Hu, Yanxing Wang, Jielong Zhou, Liangren Zhang and Zhenming Liu
2019 arXiv   pre-print
skeletons, as well as scaffolds with specifications on side-chain properties.  ...  The model can generalize the learned chemical rules of adding atoms and bonds to a given scaffold.  ...  MLP broadcast is a linear layer with BN-ReLU-Linear architecture. • Gathering: The information broadcasted to each edge is again gathered to each node.  ... 
arXiv:1908.07209v4 fatcat:vbp2yoo5pjhufi3i5t4sym23k4

Parallel computer vision on a reconfigurable multiprocessor network

S.M. Ehandarkar, H.R. Arabnia
1997 IEEE Transactions on Parallel and Distributed Systems  
The architecture is shown to contain 2D mesh topologies of varying sizes and also a single one-factor of the Boolean hypercube in any given configuration.  ...  A large class of algorithms for the 2D mesh and the Boolean n-cube are shown to map efficiently on the proposed architecture without loss of performance.  ...  Sarmad Abbasi for his comments and discussions on some of the algorithms and the proofs of some of the theorems in this paper.  ... 
doi:10.1109/71.584095 fatcat:o7q2qhadlzaljiu4ehukq3fw7y

Towards Parallel Computing on the Internet: Applications, Architectures, Models and Programming Tools [article]

Elankovan Sundararajan, Aaron Harwood
2006 arXiv   pre-print
In this survey we cover the three fundamental aspects -- application, architecture and model, and we show how they have been developed over the last decade.  ...  A number of parallel computing models exist that address this for traditional parallel architectures, and there are a number of emerging models that attempt to do this for large scale Internet-based systems  ...  : Broadcast and summation.  ... 
arXiv:cs/0612105v2 fatcat:cgttdbvuurbvbb2zjxqz6h5ehy

Amplitude Phase Shift Keying Constellation Design and its Applications to Satellite Digital Video Broadcasting [chapter]

Konstantinos P., Riccardo De, Nader Alagha, Alfonso Martinez, Albert Guilln i Fbregas
2010 Digital Video  
of the relative radius and phase shift of each -th ring with respect to the inner ring to the optimized ones found in the equiprobable case (see Section 3.2 and Tables 1-3), a new constellation design  ...  To this end, assuming equiprobable constellation points on each -th ring which allows different a priori probabilities on different rings, a new APSK constellation design optimization problem is formulated  ...  techniques making use of MIMO, hierarchical modulation and lossy compression.  ... 
doi:10.5772/8042 fatcat:unej7mdy3najzn33a5qg67s5de

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models [article]

William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna
2021 arXiv   pre-print
Further, we address different challenges of the DNN training on hierarchical networks.  ...  DNN sizes and training samples are constantly growing, making training of such workloads more challenging. Distributed training is a solution to reduce the training time.  ...  All-Gather, on the other hand, broadcasts data residing on each NPU to all other NPUs.  ... 
arXiv:2109.11762v1 fatcat:52aunlyalba7dfe3jkl23eyxle

Design of a large-scale Gbit/s MAN using a cyclic reservation-based MAC protocol

Wen-Fong Wang, Wen-Shyang Hwang, Jun-Yao Wang
2000 Journal of systems architecture  
In this paper, a large-scale Gbit/s metropolitan area network (MAN) based on hierarchical ring topologies has been investigated.  ...  The network is constituted by backbone and local rings, which are connected by bridges.  ...  Network architecture The single hierarchical ring shown in Fig. 1 consists of a number of local rings, which can connect a large number of user nodes.  ... 
doi:10.1016/s1383-7621(00)00013-8 fatcat:rl7fx6ccknf6fp2xzdsfyopxfe

DataScalar: A memory-centric approach to computing

Stefanos Kaxiras, Doug Burger, James R. Goodman
1999 Journal of systems architecture  
All processors run the same program, broadcasting operands they own to the other processors when needed, and performing any tasks that can be accomplished entirely on-chip without off-chip communication  ...  Each node accesses operands in its fast local memory and broadcasts them to the other nodes.  ...  Acknowledgments The authors thank Alain Kägi, Scott Breach, Babak Falsafi, Steve Reinhardt, and T.N. Vijaykumar for their helpful discussions and intellectual contributions to this work.  ... 
doi:10.1016/s1383-7621(98)00048-4 fatcat:zjhccit4w5g4pdkibko5tsiqbi

UNION: A unified inter/intra-chip optical network for chip multiprocessors

Xiaowen Wu, Yaoyao Ye, Wei Zhang, Weichen Liu, Mahdi Nikdast, Xuan Wang, Jiang Xu
2010 2010 IEEE/ACM International Symposium on Nanoscale Architectures  
Jointly designing communication architectures for both interchip and intrachip communication could, however, potentially yield better solutions.  ...  Traditionally, to maximize design flexibility, interchip and intrachip communication architectures are separately designed under different constraints.  ...  [24] proposed Firefly architecture as a hybrid hierarchical on-chip network.  ... 
doi:10.1109/nanoarch.2010.5510930 dblp:conf/nanoarch/WuY0LNWX10 fatcat:65ctsqm2qbcavofpqodtfznwcq
« Previous Showing results 1 — 15 out of 725 results