2,342 Hits in 5.1 sec

Threaded MPI programming model for the Epiphany RISC array processor

David Richie, James Ross, Song Park, Dale Shires
2015 Journal of Computational Science  
Using MPI exploits the similarities between the Epiphany architecture and a networked parallel distributed cluster.  ...  We present experimental results for matrix-matrix multiplication using MPI and highlight the importance of fast inter-core data transfers.  ...  The 2D mesh topology of the RISC array network creates a device-scale architecture that resembles a classic parallel distributed cluster of serial processors, where the Message Passing Interface (MPI)  ... 
doi:10.1016/j.jocs.2015.04.023 fatcat:bmycj4ivzjbifkemmle24ggl7i

Custom FPGA-based soft-processors for sparse graph acceleration

Nachiket Kapre
2015 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)  
We interconnect a 2D array of these lightweight processors with a packet-switched network-on-chip to enable fine-grained operand routing along the graph edges and provide custom send/receive instructions  ...  ZC706 board (100 processor design) across a range of matrix datasets.  ...  Using Boost pre-processor parameterization [10] , we are able to generate multiple instances of the processor to build 2D meshes of required dimensions. • Instruction Memory and Execute Stage: We specify  ... 
doi:10.1109/asap.2015.7245698 dblp:conf/asap/Kapre15 fatcat:mqos2rxf4zdkxcsq2hf6q3xji4

Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication [chapter]

Kamer Kaya, Bora Uçar, Ümit V. Çatalyürek
2014 Lecture Notes in Computer Science  
Our experiments show that the partitioning metrics influence the performance greatly in a distributed memory setting.  ...  We carry out experiments with up to 512 processors and investigate the results with regression analysis.  ...  In particular, the processor P k performs scalar multiply-add operations using local a ij 's for which µ(x j ) = P k and there is no a i with µ(x ) = P k .  ... 
doi:10.1007/978-3-642-55195-6_16 fatcat:viwsc75mibb4vovzm4yvjvnlpi

Parallel Programming Model for the Epiphany Many-Core Coprocessor Using Threaded MPI [article]

James A. Ross, David A. Richie, Song J. Park, Dale R. Shires
2015 arXiv   pre-print
We report benchmark results for the threaded MPI implementation of four algorithms (dense matrix-matrix multiplication, N-body particle interaction, a five-point 2D stencil update, and 2D FFT) and highlight  ...  The Adapteva Epiphany many-core architecture comprises a 2D tiled mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality.  ...  Army Research Laboratory-hosted Department of Defense Supercomputing Resource Center for its support of this work.  ... 
arXiv:1506.05442v1 fatcat:edidr7vxd5cglgeaieprywbdgm

Towards Structured Parallel Computing on Architecture-Independent Parallel Algorithm Design for Distributed-Memory Architectures

Feng Gao
1996 Journal of computer and system sciences (Print)  
) for the algorithm, and design of emulations of the virtual networks on physical networks.  ...  In a paper by Gao, a general theory of portable optimality of parallel algorithms is presented.  ...  I thank Maria Klawe and Nick Pippenger for valuable comments and for criticism on a draft of this paper.  ... 
doi:10.1006/jcss.1996.0053 fatcat:36p2jze2gzee7nibaxrja7fl7m

Matrix decomposition on the star graph

A.-E. Al-Ayyoub, K. Day
1997 IEEE Transactions on Parallel and Distributed Systems  
computation complexity and uses O(Nn) communication time to decompose a matrix of order N on a star graph of dimension n, where N ≥ (n -1)!.  ...  pivot row and multipliers column broadcasts.  ...  In this approach, most of the existing matrix distribution methods can be viewed as instances of a more general distribution function called 2D matrix distribution [3] , [4] .  ... 
doi:10.1109/71.605767 fatcat:poyomvnha5dt3kz4tatolf7qm4

The design and implementation of the TRIPS prototype chip

Robert McDonald, Doug Burger, Steve Keckler
2005 2005 IEEE Hot Chips XVII Symposium (HCS)  
Vector add (limited by load/store bandwidth) 74 6.51 3.04 vadd Secure hash (mostly sequential algorithm) 80 2.10 2.28 sha Matrix multiply 72 4.05 1.68 matrix 2D discrete cosine  ...  8 GB of SDRAM (NUMA) PPC 440GP FPGA PowerPC 440GP used as control processor and host interface 2D chip-to-chip (C2C) network connects multiple TRIPS chips Intended for exploration of parallel  ... 
doi:10.1109/hotchips.2005.7476592 fatcat:nqakqbamazeidbhg6klqdtkguy

JAMPI: Efficient Matrix Multiplication in Spark Using Barrier Execution Mode

Tamas Foldi, Chris von Csefalvay, Nicolas A. Perez
2020 Big Data and Cognitive Computing  
By combining distributed message passing (using asynchronous network IO), OpenJDK's new auto-vectorization and Spark's barrier execution mode, we can add non-map/reduce-based algorithms, such as Cannon's  ...  The new barrier mode in Apache Spark allows for embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow.  ...  It is known, for instance, that the memory requirement for each processor increases as we add processors to a computation.  ... 
doi:10.3390/bdcc4040032 fatcat:tnuh62oddbdp7gubaz3fch5jdm

A Survey on Dynamically Reconfigurable Processors

2006 IEICE transactions on communications  
Hideharu AMANO †a) , Member SUMMARY Dynamically reconfigurable processors are consisting of an array of processing elements whose functions and interconnections can be dynamically changed. 9 commercial  ...  systems are picked up, and their array structures, processing elements and interconnection architectures are classified.  ...  Name Interconnect CS2112 Tile base, 2D-bus DAPDNA-2 Segment base, 2D-bus FE-GA 2D-mesh direct, Crossbar for memories Cluster machine 3-stage switch DRP-1 Tile base, 2D-bus Kilocore KC256 Crossbar  ... 
doi:10.1093/ietcom/e89-b.12.3179 fatcat:z7uep5s5jfehtkgepanutfwgye

Layer Based Partition for Matrix Multiplication on Heterogeneous Processor Platforms [article]

Yang Liu, Li Shi, Junwei Zhang, Thomas G. Robertazzi
2018 arXiv   pre-print
In this paper, we propose a new method that schedules matrix multiplication on heterogeneous processor platforms with the mixed co-design goal of minimizing the total communication volume and the multiplication  ...  To summarize, this is a promising perspective of tackling matrix multiplication problems on heterogeneous processor platforms.  ...  The mesh network is heterogeneous, with each link speed and processor speed independently generated.  ... 
arXiv:1812.06329v1 fatcat:y4gwgyvc3bf4pjeclsae42t3de

Highly Parallel Sparse Matrix-Matrix Multiplication [article]

Aydın Buluç, John R. Gilbert
2010 arXiv   pre-print
Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid.  ...  Our algorithms are based on two-dimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability.  ...  The distribution of matrix A on a single processor row is shown in Figure 27 .  ... 
arXiv:1006.2183v1 fatcat:ej4x646fvnatzjd5z2e2uodtau

The spatial computer: A model for energy-efficient parallel computation [article]

Lukas Gianinazzi, Tal Ben-Nun, Saleh Ashkboos, Yves Baumann, Piotr Luczynski, Torsten Hoefler
2022 arXiv   pre-print
We show matching energy lower and upper bounds for many foundational problems, including sorting, median selection, and matrix multiplication.  ...  We also show how to simulate PRAM algorithms in our model and how to obtain results for a more complex model that introduces the size of the local memories of the processors as a parameter.  ...  Finally, in each time step a processor can perform a constant number of arithmetic and logic operations on its memory, and generate an independent, uniformly distributed word-sized number.  ... 
arXiv:2205.04934v1 fatcat:lxqr3up5w5au7aj3fheurpop2a

Controlling a physical model with a 2D force matrix

Randy Jones, Andrew Schloss
2007 Proceedings of the 7th international conference on New interfaces for musical expression - NIME '07  
In this paper we describe our work towards an instrument for percussion synthesis, in which a waveguide mesh is both excited and damped by a 2D matrix of forces from a sensor.  ...  By emulating a drum skin both as controller and sound generator, our instrument has reproduced some of the expressive qualities of hand drumming.  ...  We add exponential damping per junction by simply multiplying the mesh with a damping matrix at each sample.  ... 
doi:10.1145/1279740.1279742 dblp:conf/nime/JonesS07 fatcat:kmf3b43fzrg5dirulko6eu3ggy

Scaling Block Conjugate Gradient Variants Orthomin and Orthodir [article]

Cevdet Aykanat, Oguz Selvitopi, M. Ozan Karsavuran
2019 Zenodo  
We investigate 1D- and 2D-partitioning of the sparse coecient matrix for encapsulating the minimization of the communication overhead as well as one- and two-constraint partitioning for computational load  ...  Two di erent parallel codes for Orthomin and Orthodir variants are developed.  ...  We acknowledge PRACE for awarding us access to resource JUWELS based in Germany at Jülich Supercomputing Centre (JSC). We acknowledge UHeM for awarding us access to resource Sariyer based in Turkey.  ... 
doi:10.5281/zenodo.2670068 fatcat:w5f32sn2tvf4phk6ss3mgy2ycq

SmartCell: An Energy Efficient Coarse-Grained Reconfigurable Architecture for Stream-Based Applications

Cao Liang, Xinming Huang
2009 EURASIP Journal on Embedded Systems  
This paper presents SmartCell, a novel coarse-grained reconfigurable architecture, which tiles a large number of processor elements with reconfigurable interconnection fabrics on a single chip.  ...  It is concluded that SmartCell system is a promising reconfigurable and energy efficient architecture for stream processing.  ...  Acknowledgments This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) Young Faculty Award under Grant W911NF-07-1-0191-P00001, and by the National Science Foundation  ... 
doi:10.1155/2009/518659 fatcat:hftuf2y3nvcenjbbcui3o5zqea
« Previous Showing results 1 — 15 out of 2,342 results