Filters








138 Hits in 4.4 sec

A simplified design strategy for mapping image processing algorithms on a SIMD torus

Guna Seetharaman
1995 Theoretical Computer Science  
It is proposed to enhance and simplify the programming of a two dimensional (2-D) torus (and mesh) connected SIMD array of simple processing elements (PEs) by introducing two dedicated communication registers  ...  A new SIMD algorithm to transpose a matrix using only two buffers at each PE is described.  ...  The importance of arbitrarily many one-to-one communication between processors for many image processing tasks has been emphasized. Algorithm for transposing a matrix has been described.  ... 
doi:10.1016/0304-3975(95)95694-f fatcat:qocx77hozfcm5dmigtqaor7kke

Page 1121 of Mathematical Reviews Vol. , Issue 96b [page]

1996 Mathematical Reviews  
Gilbert Crombez (B-GHNT-A; Ghent) 96b:68185 68U10 68M10 68Q22 68Q35 Seetharaman, Guna (1-SWLA-CV; Lafayette, LA) A simplified design strategy for mapping image processing algorithms on a SIMD torus.  ...  Summary: “We propose to enhance and simplify the program- ming of a two-dimensional (2-D) torus (and mesh) connected SIMD array of simple processing elements (PEs) by introducing two dedicated communication  ... 

BlueGene/L applications: Parallelism On a Massive Scale

Bronis R. de Supinski, Martin Schulz, Vasily V. Bulatov, William Cabot, Bor Chan, Andrew W. Cook, Erik W. Draeger, James N. Glosli, Jeffrey A. Greenough, Keith Henderson, Alison Kubota, Steve Louis (+30 others)
2008 The international journal of high performance computing applications  
of the BG/L design.  ...  BG/L has led the last four Top500 lists with a Linpack rate of 280.6 Tflop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions.  ...  Custom torus mappings were used to improve performance on 8K and 32K nodes.  ... 
doi:10.1177/1094342007085025 fatcat:s5h4ai3mvvciploic4ljt7n434

Using emulations to enhance the performance of parallel architectures

B. Obrenic, M.C. Herbordt, A.L. Rosenberg, C.C. Weems
1999 IEEE Transactions on Parallel and Distributed Systems  
The vehicle for this demonstration is a suite of algorithms that endow an N-processor bit-serial processor array e with a ªmeta-instructionº GAUGE k, which (logically) reconfigures e into an N=k-processor  ...  We instantiate our technique in detail for arrays based on topologies with quite disparate characteristics: the hypercube, the de Bruijn network, and a genre of mesh with reconfigurable buses.  ...  Herbordt The authors are grateful to Fred Annexstein and Marc Baumslag for valuable discussions concerning the algorithms presented herein and to the anonymous reviewers for many suggestions concerning  ... 
doi:10.1109/71.808155 fatcat:ywjp6sqk5jai7cswtp7gc7xezq

Customized High Performance and Energy Efficient Communication Networks for AI Chips

Wei Gao, Pingqiang Zhou
2019 IEEE Access  
The convolutional and deep neural networks are prevalent machine learning algorithms for real-world applications.  ...  This paper introduces the communication network in AI chips and the strategy of mapping neural network to chips with the extensible hierarchical architecture.  ...  For instance, in [9] , it is reported that compared with a CPU (SIMD), the GPU can achieve one to two orders of speedup; while the fully-customized ASIC accelerator DaDiannao can further improve the processing  ... 
doi:10.1109/access.2019.2916338 fatcat:oepm3giaz5d3hnusryrgfaj2ze

Systolic opportunities for multidimensional data streams

S.M. Chai, D.S. Wills
2002 IEEE Transactions on Parallel and Distributed Systems  
Performance comparisons for a set of signal processing algorithms show that systolic arrays that consider planar data streams in the design process are up to three times faster than traditional arrays.  ...  A synthesis technique using dependence graphs, data partitioning, and computation mapping is developed to handle planar data streams and to systematically design arrays with area I/O.  ...  Multimedia enhanced processors with SIMD extensions have been designed to harness the data parallelism in the image processing chain.  ... 
doi:10.1109/71.995819 fatcat:vxlbao3ggvekpi3o4rt6evovb4

AN OVERVIEW OF THE BLUEGENE/L SYSTEM SOFTWARE ORGANIZATION

GEORGE ALMÁSI, RALPH BELLOFATTO, JOSÉ BRUNHEROTO, CĂLIN CAŞCAVAL, JOSÉ G. CASTAÑOS, PAUL CRUMLEY, C. CHRISTOPHER ERWAY, DEREK LIEBER, XAVIER MARTORELL, JOSÉ E. MOREIRA, RAMENDRA SAHOO, ALDA SANOMIYA (+2 others)
2003 Parallel Processing Letters  
The Blue Gene/L supercomputer will use system-on-a-chip integration and a highly scalable cellular architecture.  ...  In this paper we present our vision of a software architecture that faces up to these challenges, and the simulation framework that we have used for our experiments.  ...  We use one boot image for all the compute nodes and another boot image for all the I/O nodes.  ... 
doi:10.1142/s0129626403001513 fatcat:2whcow3pfrerbogoz5icecmhpy

An Overview of the Blue Gene/L System Software Organization [chapter]

George Almási, Ralph Bellofatto, José Brunheroto, Călin Caşcaval, José G. Castaños, Luis Ceze, Paul Crumley, C. Christopher Erway, Joseph Gagliano, Derek Lieber, Xavier Martorell, José E. Moreira (+2 others)
2003 Lecture Notes in Computer Science  
The Blue Gene/L supercomputer will use system-on-a-chip integration and a highly scalable cellular architecture.  ...  In this paper we present our vision of a software architecture that faces up to these challenges, and the simulation framework that we have used for our experiments.  ...  We use one boot image for all the compute nodes and another boot image for all the I/O nodes.  ... 
doi:10.1007/978-3-540-45209-6_79 fatcat:3z2vgesyhvdkjdgqyks7gzxvke

Massively parallel computing: A Sandia perspective

David E Womble, Sudip S Dosanjh, Bruce Hendrickson, Michael A Heroux, Steve J Plimpton, James L Tomkins, David S Greenberg
1999 Parallel Computing  
This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories.  ...  We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry.  ...  In short the debugger must become a distributed program. The partition model provides a simple strategy for the debugger.  ... 
doi:10.1016/s0167-8191(99)00068-x fatcat:fohy7xwhffecbldzj4pwmwr5k4

Realizing reconfigurable mesh algorithms on softcore arrays

Heiner Giefers, Marco Platzner
2008 2008 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation  
The reconfigurable mesh is a very popular model for massively parallel computation for which a large body of algorithms with exceptionally low runtime complexities exists.  ...  In this paper, we present the mapping of reconfigurable mesh algorithms to softcore arrays.  ...  There is a wide range of application domains for which reconfigurable mesh algorithms exist, including arithmetic, sorting, selection, graph algorithms, computational geometry and image processing.  ... 
doi:10.1109/icsamos.2008.4664845 dblp:conf/samos/GiefersP08 fatcat:353d4qdnxjgafpgkes6vymuzpm

EARLY EXPERIENCES WITH THE 360TF IBM BLUE GENE/L PLATFORM

G. BHANOT, J. M. DENNIS, J. EDWARDS, W. GRABOWSKI, M. GUPTA, K. JORDAN, R. D. LOFT, J. SEXTON, A. ST-CYR, S. J. THOMAS, H. M. TUFO, T. VORAN (+2 others)
2008 International Journal of Computational Methods  
The 3D moist primitive equations are solved on the cubed-sphere with a hybrid pressure η vertical coordinate using an Emanuel convective parameterization for moist processes.  ...  Results obtained on a 32 rack Blue Gene/L system (65,536 processors, 183.5 TeraFlops peak) show sustained performance of 8.0 TeraFlops on 32,768 processors for the moist Held-Suarez test problem in coprocessor  ...  BG/L uses this dense packaging along with the system-on-a-chip design of the PowerPC node to integrate a torus interconnect and dedicated networks for collective operations, and uses a novel software architecture  ... 
doi:10.1142/s0219876208001443 fatcat:wkhqlayfjreytnjgfb7y7vyrxa

Least-squares fitting of analytic primitives on a GPU

Meghashyam Panyam Mohan Ram, Thomas R. Kurfess, Thomas M. Tucker
2008 Journal of manufacturing systems  
The least squares fit algorithms for the circle, sphere, cylinder, plane, cone iv and torus have been implemented on the GPU using NVIDIA"s Compute Unified Device Architecture (CUDA).  ...  The combination of large stream of data and the need for 3D vector operations make the primitive shape fit algorithms excellent candidates for processing via a GPU.  ...  It is a very important aspect of the design and manufacturing process.  ... 
doi:10.1016/j.jmsy.2008.07.004 fatcat:eoeg4oejhbhulgxc4pfabcwz3i

A methodology and a case-study for Network-on-Chip based MP-SoC architectures

Sergio V. Tota, Mario R. Casu, Paolo Motto, Massimo Ruo Roch, Maurizio Zamboni
2007 Proceedings of the Second International Conference on Nano-Networks  
The proposed design flow has been used in the implementation of a multiprocessor Network-on-Chip based system, the NoCRay graphic accelerator.  ...  The system uses 8 Tensilica LX processors and has been physically implemented on a Xilinx Virtex-4 LX-160 FPGA reporting a 17.3M equivalent gate-count.  ...  The communication between processors is based on a Networkon-Chip approach with a folded-torus topology and a switch based on the deflection-routing algorithm [14] .  ... 
doi:10.4108/icst.nanonet2007.2122 dblp:conf/nanonet/TotaCMRZ07 fatcat:s46jofenizbxnjdebafoiyvhay

Accelerating GAN training using highly parallel hardware on public cloud

Renato Cardoso, Dejan Golubovic, Ignacio Peluaga Lozada, Ricardo Rocha, João Fernandes, Sofia Vallecorsa, C. Biscarat, S. Campana, B. Hegner, S. Roiser, C.I. Rovelli, G.A. Stewart
2021 EPJ Web of Conferences  
More specifically, we parallelize the training process on multiple GPUs and Google Tensor Processing Units (TPU) and we compare two algorithms: the TensorFlow built-in logic and a custom loop, optimised  ...  This work explores different types of cloud services to train a Generative Adversarial Network (GAN) in a parallel environment, using Tensorflow data parallel strategy.  ...  models are being tested as fast alternatives to Monte Carlo based simulation and anomaly detection algorithms are being explored to design searches for rare new-physics processes.  ... 
doi:10.1051/epjconf/202125102073 fatcat:m3wphsqmyzdzzamwqjthsiatwu

Entropy thresholding and its parallel algorithm on the reconfigurable array of processors with wider bus networks

Shung-Shing Lee, Shi-Jinn Horng, Horng-Ren Tsai
1999 IEEE Transactions on Image Processing  
Second, we derive a constant time parallel algorithm to solve this problem on the reconfigurable array of processors with wider bus networks (RAPWBN).  ...  Instead of increasing the number of processors, we extend the number of buses to increase the power of a parallel processing system.  ...  It fully depends on the algorithm designed and the system architecture proposed. Mesh-connected computers (MCC's) are one well-known example of a parallel processing system [20] .  ... 
doi:10.1109/83.784435 pmid:18267540 fatcat:oikc4ygfp5gnta6ss3xfpejz4m
« Previous Showing results 1 — 15 out of 138 results