A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
A simplified design strategy for mapping image processing algorithms on a SIMD torus
1995
Theoretical Computer Science
It is proposed to enhance and simplify the programming of a two dimensional (2-D) torus (and mesh) connected SIMD array of simple processing elements (PEs) by introducing two dedicated communication registers ...
A new SIMD algorithm to transpose a matrix using only two buffers at each PE is described. ...
The importance of arbitrarily many one-to-one communication between processors for many image processing tasks has been emphasized. Algorithm for transposing a matrix has been described. ...
doi:10.1016/0304-3975(95)95694-f
fatcat:qocx77hozfcm5dmigtqaor7kke
Page 1121 of Mathematical Reviews Vol. , Issue 96b
[page]
1996
Mathematical Reviews
Gilbert Crombez (B-GHNT-A; Ghent)
96b:68185 68U10 68M10 68Q22 68Q35
Seetharaman, Guna (1-SWLA-CV; Lafayette, LA)
A simplified design strategy for mapping image processing algorithms on a SIMD torus. ...
Summary: “We propose to enhance and simplify the program- ming of a two-dimensional (2-D) torus (and mesh) connected SIMD array of simple processing elements (PEs) by introducing two dedicated communication ...
BlueGene/L applications: Parallelism On a Massive Scale
2008
The international journal of high performance computing applications
of the BG/L design. ...
BG/L has led the last four Top500 lists with a Linpack rate of 280.6 Tflop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions. ...
Custom torus mappings were used to improve performance on 8K and 32K nodes. ...
doi:10.1177/1094342007085025
fatcat:s5h4ai3mvvciploic4ljt7n434
Using emulations to enhance the performance of parallel architectures
1999
IEEE Transactions on Parallel and Distributed Systems
The vehicle for this demonstration is a suite of algorithms that endow an N-processor bit-serial processor array e with a ªmeta-instructionº GAUGE k, which (logically) reconfigures e into an N=k-processor ...
We instantiate our technique in detail for arrays based on topologies with quite disparate characteristics: the hypercube, the de Bruijn network, and a genre of mesh with reconfigurable buses. ...
Herbordt The authors are grateful to Fred Annexstein and Marc Baumslag for valuable discussions concerning the algorithms presented herein and to the anonymous reviewers for many suggestions concerning ...
doi:10.1109/71.808155
fatcat:ywjp6sqk5jai7cswtp7gc7xezq
Customized High Performance and Energy Efficient Communication Networks for AI Chips
2019
IEEE Access
The convolutional and deep neural networks are prevalent machine learning algorithms for real-world applications. ...
This paper introduces the communication network in AI chips and the strategy of mapping neural network to chips with the extensible hierarchical architecture. ...
For instance, in [9] , it is reported that compared with a CPU (SIMD), the GPU can achieve one to two orders of speedup; while the fully-customized ASIC accelerator DaDiannao can further improve the processing ...
doi:10.1109/access.2019.2916338
fatcat:oepm3giaz5d3hnusryrgfaj2ze
Systolic opportunities for multidimensional data streams
2002
IEEE Transactions on Parallel and Distributed Systems
Performance comparisons for a set of signal processing algorithms show that systolic arrays that consider planar data streams in the design process are up to three times faster than traditional arrays. ...
A synthesis technique using dependence graphs, data partitioning, and computation mapping is developed to handle planar data streams and to systematically design arrays with area I/O. ...
Multimedia enhanced processors with SIMD extensions have been designed to harness the data parallelism in the image processing chain. ...
doi:10.1109/71.995819
fatcat:vxlbao3ggvekpi3o4rt6evovb4
AN OVERVIEW OF THE BLUEGENE/L SYSTEM SOFTWARE ORGANIZATION
2003
Parallel Processing Letters
The Blue Gene/L supercomputer will use system-on-a-chip integration and a highly scalable cellular architecture. ...
In this paper we present our vision of a software architecture that faces up to these challenges, and the simulation framework that we have used for our experiments. ...
We use one boot image for all the compute nodes and another boot image for all the I/O nodes. ...
doi:10.1142/s0129626403001513
fatcat:2whcow3pfrerbogoz5icecmhpy
An Overview of the Blue Gene/L System Software Organization
[chapter]
2003
Lecture Notes in Computer Science
The Blue Gene/L supercomputer will use system-on-a-chip integration and a highly scalable cellular architecture. ...
In this paper we present our vision of a software architecture that faces up to these challenges, and the simulation framework that we have used for our experiments. ...
We use one boot image for all the compute nodes and another boot image for all the I/O nodes. ...
doi:10.1007/978-3-540-45209-6_79
fatcat:3z2vgesyhvdkjdgqyks7gzxvke
Massively parallel computing: A Sandia perspective
1999
Parallel Computing
This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. ...
We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry. ...
In short the debugger must become a distributed program. The partition model provides a simple strategy for the debugger. ...
doi:10.1016/s0167-8191(99)00068-x
fatcat:fohy7xwhffecbldzj4pwmwr5k4
Realizing reconfigurable mesh algorithms on softcore arrays
2008
2008 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation
The reconfigurable mesh is a very popular model for massively parallel computation for which a large body of algorithms with exceptionally low runtime complexities exists. ...
In this paper, we present the mapping of reconfigurable mesh algorithms to softcore arrays. ...
There is a wide range of application domains for which reconfigurable mesh algorithms exist, including arithmetic, sorting, selection, graph algorithms, computational geometry and image processing. ...
doi:10.1109/icsamos.2008.4664845
dblp:conf/samos/GiefersP08
fatcat:353d4qdnxjgafpgkes6vymuzpm
EARLY EXPERIENCES WITH THE 360TF IBM BLUE GENE/L PLATFORM
2008
International Journal of Computational Methods
The 3D moist primitive equations are solved on the cubed-sphere with a hybrid pressure η vertical coordinate using an Emanuel convective parameterization for moist processes. ...
Results obtained on a 32 rack Blue Gene/L system (65,536 processors, 183.5 TeraFlops peak) show sustained performance of 8.0 TeraFlops on 32,768 processors for the moist Held-Suarez test problem in coprocessor ...
BG/L uses this dense packaging along with the system-on-a-chip design of the PowerPC node to integrate a torus interconnect and dedicated networks for collective operations, and uses a novel software architecture ...
doi:10.1142/s0219876208001443
fatcat:wkhqlayfjreytnjgfb7y7vyrxa
Least-squares fitting of analytic primitives on a GPU
2008
Journal of manufacturing systems
The least squares fit algorithms for the circle, sphere, cylinder, plane, cone iv and torus have been implemented on the GPU using NVIDIA"s Compute Unified Device Architecture (CUDA). ...
The combination of large stream of data and the need for 3D vector operations make the primitive shape fit algorithms excellent candidates for processing via a GPU. ...
It is a very important aspect of the design and manufacturing process. ...
doi:10.1016/j.jmsy.2008.07.004
fatcat:eoeg4oejhbhulgxc4pfabcwz3i
A methodology and a case-study for Network-on-Chip based MP-SoC architectures
2007
Proceedings of the Second International Conference on Nano-Networks
The proposed design flow has been used in the implementation of a multiprocessor Network-on-Chip based system, the NoCRay graphic accelerator. ...
The system uses 8 Tensilica LX processors and has been physically implemented on a Xilinx Virtex-4 LX-160 FPGA reporting a 17.3M equivalent gate-count. ...
The communication between processors is based on a Networkon-Chip approach with a folded-torus topology and a switch based on the deflection-routing algorithm [14] . ...
doi:10.4108/icst.nanonet2007.2122
dblp:conf/nanonet/TotaCMRZ07
fatcat:s46jofenizbxnjdebafoiyvhay
Accelerating GAN training using highly parallel hardware on public cloud
2021
EPJ Web of Conferences
More specifically, we parallelize the training process on multiple GPUs and Google Tensor Processing Units (TPU) and we compare two algorithms: the TensorFlow built-in logic and a custom loop, optimised ...
This work explores different types of cloud services to train a Generative Adversarial Network (GAN) in a parallel environment, using Tensorflow data parallel strategy. ...
models are being tested as fast alternatives to Monte Carlo based simulation and anomaly detection algorithms are being explored to design searches for rare new-physics processes. ...
doi:10.1051/epjconf/202125102073
fatcat:m3wphsqmyzdzzamwqjthsiatwu
Entropy thresholding and its parallel algorithm on the reconfigurable array of processors with wider bus networks
1999
IEEE Transactions on Image Processing
Second, we derive a constant time parallel algorithm to solve this problem on the reconfigurable array of processors with wider bus networks (RAPWBN). ...
Instead of increasing the number of processors, we extend the number of buses to increase the power of a parallel processing system. ...
It fully depends on the algorithm designed and the system architecture proposed. Mesh-connected computers (MCC's) are one well-known example of a parallel processing system [20] . ...
doi:10.1109/83.784435
pmid:18267540
fatcat:oikc4ygfp5gnta6ss3xfpejz4m
« Previous
Showing results 1 — 15 out of 138 results