A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2018; you can also visit the original URL.
The file type is application/pdf
.
Filters
Threaded MPI programming model for the Epiphany RISC array processor
2015
Journal of Computational Science
Using MPI exploits the similarities between the Epiphany architecture and a networked parallel distributed cluster. ...
We present experimental results for matrix-matrix multiplication using MPI and highlight the importance of fast inter-core data transfers. ...
The 2D mesh topology of the RISC array network creates a device-scale architecture that resembles a classic parallel distributed cluster of serial processors, where the Message Passing Interface (MPI) ...
doi:10.1016/j.jocs.2015.04.023
fatcat:bmycj4ivzjbifkemmle24ggl7i
Custom FPGA-based soft-processors for sparse graph acceleration
2015
2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
We interconnect a 2D array of these lightweight processors with a packet-switched network-on-chip to enable fine-grained operand routing along the graph edges and provide custom send/receive instructions ...
ZC706 board (100 processor design) across a range of matrix datasets. ...
Using Boost pre-processor parameterization [10] , we are able to generate multiple instances of the processor to build 2D meshes of required dimensions. • Instruction Memory and Execute Stage: We specify ...
doi:10.1109/asap.2015.7245698
dblp:conf/asap/Kapre15
fatcat:mqos2rxf4zdkxcsq2hf6q3xji4
Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication
[chapter]
2014
Lecture Notes in Computer Science
Our experiments show that the partitioning metrics influence the performance greatly in a distributed memory setting. ...
We carry out experiments with up to 512 processors and investigate the results with regression analysis. ...
In particular, the processor P k performs scalar multiply-add operations using local a ij 's for which µ(x j ) = P k and there is no a i with µ(x ) = P k . ...
doi:10.1007/978-3-642-55195-6_16
fatcat:viwsc75mibb4vovzm4yvjvnlpi
Parallel Programming Model for the Epiphany Many-Core Coprocessor Using Threaded MPI
[article]
2015
arXiv
pre-print
We report benchmark results for the threaded MPI implementation of four algorithms (dense matrix-matrix multiplication, N-body particle interaction, a five-point 2D stencil update, and 2D FFT) and highlight ...
The Adapteva Epiphany many-core architecture comprises a 2D tiled mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. ...
Army Research Laboratory-hosted Department of Defense Supercomputing Resource Center for its support of this work. ...
arXiv:1506.05442v1
fatcat:edidr7vxd5cglgeaieprywbdgm
Towards Structured Parallel Computing on Architecture-Independent Parallel Algorithm Design for Distributed-Memory Architectures
1996
Journal of computer and system sciences (Print)
) for the algorithm, and design of emulations of the virtual networks on physical networks. ...
In a paper by Gao, a general theory of portable optimality of parallel algorithms is presented. ...
I thank Maria Klawe and Nick Pippenger for valuable comments and for criticism on a draft of this paper. ...
doi:10.1006/jcss.1996.0053
fatcat:36p2jze2gzee7nibaxrja7fl7m
Matrix decomposition on the star graph
1997
IEEE Transactions on Parallel and Distributed Systems
computation complexity and uses O(Nn) communication time to decompose a matrix of order N on a star graph of dimension n, where N ≥ (n -1)!. ...
pivot row and multipliers column broadcasts. ...
In this approach, most of the existing matrix distribution methods can be viewed as instances of a more general distribution function called 2D matrix distribution [3] , [4] . ...
doi:10.1109/71.605767
fatcat:poyomvnha5dt3kz4tatolf7qm4
The design and implementation of the TRIPS prototype chip
2005
2005 IEEE Hot Chips XVII Symposium (HCS)
Vector add
(limited by load/store bandwidth)
74
6.51
3.04
vadd
Secure hash
(mostly sequential algorithm)
80
2.10
2.28
sha
Matrix multiply
72
4.05
1.68
matrix
2D discrete cosine ...
8 GB of SDRAM (NUMA)
PPC 440GP
FPGA
PowerPC 440GP used as
control processor and
host interface
2D chip-to-chip (C2C)
network connects multiple
TRIPS chips
Intended for exploration of
parallel ...
doi:10.1109/hotchips.2005.7476592
fatcat:nqakqbamazeidbhg6klqdtkguy
JAMPI: Efficient Matrix Multiplication in Spark Using Barrier Execution Mode
2020
Big Data and Cognitive Computing
By combining distributed message passing (using asynchronous network IO), OpenJDK's new auto-vectorization and Spark's barrier execution mode, we can add non-map/reduce-based algorithms, such as Cannon's ...
The new barrier mode in Apache Spark allows for embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. ...
It is known, for instance, that the memory requirement for each processor increases as we add processors to a computation. ...
doi:10.3390/bdcc4040032
fatcat:tnuh62oddbdp7gubaz3fch5jdm
A Survey on Dynamically Reconfigurable Processors
2006
IEICE transactions on communications
Hideharu AMANO †a) , Member SUMMARY Dynamically reconfigurable processors are consisting of an array of processing elements whose functions and interconnections can be dynamically changed. 9 commercial ...
systems are picked up, and their array structures, processing elements and interconnection architectures are classified. ...
Name
Interconnect
CS2112
Tile base, 2D-bus
DAPDNA-2
Segment base, 2D-bus
FE-GA
2D-mesh direct, Crossbar for memories
Cluster machine
3-stage switch
DRP-1
Tile base, 2D-bus
Kilocore KC256
Crossbar ...
doi:10.1093/ietcom/e89-b.12.3179
fatcat:z7uep5s5jfehtkgepanutfwgye
Layer Based Partition for Matrix Multiplication on Heterogeneous Processor Platforms
[article]
2018
arXiv
pre-print
In this paper, we propose a new method that schedules matrix multiplication on heterogeneous processor platforms with the mixed co-design goal of minimizing the total communication volume and the multiplication ...
To summarize, this is a promising perspective of tackling matrix multiplication problems on heterogeneous processor platforms. ...
The mesh network is heterogeneous, with each link speed and processor speed independently generated. ...
arXiv:1812.06329v1
fatcat:y4gwgyvc3bf4pjeclsae42t3de
Highly Parallel Sparse Matrix-Matrix Multiplication
[article]
2010
arXiv
pre-print
Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. ...
Our algorithms are based on two-dimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. ...
The distribution of matrix A on a single processor row is shown in Figure 27 . ...
arXiv:1006.2183v1
fatcat:ej4x646fvnatzjd5z2e2uodtau
The spatial computer: A model for energy-efficient parallel computation
[article]
2022
arXiv
pre-print
We show matching energy lower and upper bounds for many foundational problems, including sorting, median selection, and matrix multiplication. ...
We also show how to simulate PRAM algorithms in our model and how to obtain results for a more complex model that introduces the size of the local memories of the processors as a parameter. ...
Finally, in each time step a processor can perform a constant number of arithmetic and logic operations on its memory, and generate an independent, uniformly distributed word-sized number. ...
arXiv:2205.04934v1
fatcat:lxqr3up5w5au7aj3fheurpop2a
Controlling a physical model with a 2D force matrix
2007
Proceedings of the 7th international conference on New interfaces for musical expression - NIME '07
In this paper we describe our work towards an instrument for percussion synthesis, in which a waveguide mesh is both excited and damped by a 2D matrix of forces from a sensor. ...
By emulating a drum skin both as controller and sound generator, our instrument has reproduced some of the expressive qualities of hand drumming. ...
We add exponential damping per junction by simply multiplying the mesh with a damping matrix at each sample. ...
doi:10.1145/1279740.1279742
dblp:conf/nime/JonesS07
fatcat:kmf3b43fzrg5dirulko6eu3ggy
Scaling Block Conjugate Gradient Variants Orthomin and Orthodir
[article]
2019
Zenodo
We investigate 1D- and 2D-partitioning of the sparse coecient matrix for encapsulating the minimization of the communication overhead as well as one- and two-constraint partitioning for computational load ...
Two di erent parallel codes for Orthomin and Orthodir variants are developed. ...
We acknowledge PRACE for awarding us access to resource JUWELS based in Germany at Jülich Supercomputing Centre (JSC). We acknowledge UHeM for awarding us access to resource Sariyer based in Turkey. ...
doi:10.5281/zenodo.2670068
fatcat:w5f32sn2tvf4phk6ss3mgy2ycq
SmartCell: An Energy Efficient Coarse-Grained Reconfigurable Architecture for Stream-Based Applications
2009
EURASIP Journal on Embedded Systems
This paper presents SmartCell, a novel coarse-grained reconfigurable architecture, which tiles a large number of processor elements with reconfigurable interconnection fabrics on a single chip. ...
It is concluded that SmartCell system is a promising reconfigurable and energy efficient architecture for stream processing. ...
Acknowledgments This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) Young Faculty Award under Grant W911NF-07-1-0191-P00001, and by the National Science Foundation ...
doi:10.1155/2009/518659
fatcat:hftuf2y3nvcenjbbcui3o5zqea
« Previous
Showing results 1 — 15 out of 2,342 results