3,223 Hits in 5.3 sec

Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures [article]

Matti Karppa, Petteri Kaski
2019 arXiv   pre-print
We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, with the objective of time-and-energy-efficient scaling up to input  ...  execution on accelerator hardware, (c) low-level engineering of the innermost block products for the specific target hardware, and (d) structuring the top-level shared-memory implementation to feed the  ...  Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures At first reading, it may be convenient to assume that π is the identity permutation.  ... 
arXiv:1909.01554v1 fatcat:njvf7t2nybhnnfpzjrqqxbxzta


2019 Figshare  
Thus, it worths to define, for training and running a CNN,a programmable accelerator. The paper describes the organization and the architecture of ahybrid system based on Map-Reduce architecture.  ...  They aremarked by limitations due to their too general and ad hoc structure and architecture. Wepropose an accelerator with a Map-Reduce architecture.  ...  State of the Art The currently used accelerators for CNN applications are based on Nvidia or Intel's Xeon Phi parallel engines.  ... 
doi:10.6084/m9.figshare.8264096 fatcat:bjgx6caa5zh7jldz5wlrzhxceu

Memcomputing: fusion of memory and computing

Yi Li, Yaxiong Zhou, Zhuorui Wang, Xiangshui Miao
2018 Science China Information Sciences  
Memcomputing: fusion of memory and computing. Sci China Inf Sci, 2018, 61(6): 060424, https://doi.  ...  To accelerate convolution computation or matrix vector multiplication, the implementation of the dotproduct engine in crossbars is an energy-efficient parallel solution for hardware deep learning and neuromorphic  ...  The first two attributes open an intriguing opportunity for the fusion of memory and computing to develop non-von Neumann architectures.  ... 
doi:10.1007/s11432-017-9313-6 fatcat:kgvmf3w3wnczjm5s7dlhlr3zwu

Efficient Spatial Processing Element Control via Triggered Instructions

Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess (+2 others)
2014 IEEE Micro  
......Recentl y, single-instruction, multiple-data (SIMD) and single-instruction, multiple-thread (SIMT) accelerators such as GPGPUs have been shown to be effective as offload engines when paired with  ...  In a multicore system, the typical approach is to use shared memory for the queue buffering, along with sophisticated polling mechanisms such as memory monitors.  ...  Bushra Ahsan is a component design engineer at Intel. Her research focuses on memory systems architecture design and workloads for spatial architectures.  ... 
doi:10.1109/mm.2014.14 fatcat:idejsg2kovdmhoune77bqhgi5m

Automata Processor Architecture and Applications: A Survey

Nourah A. Almubarak, Anwar Alshammeri, Imtiaz Ahmad
2016 International Journal of Grid and Distributed Computing  
One such technology is the recently introduced Micron Automata Processor (AP), which is a novel and powerful reconfigurable non-von Neumann processor that can be used for direct implementation of multiple  ...  The AP is promising future technology which provides new operations and new avenues for exploiting parallelism to accelerate the growing and important class of automata-based algorithms.  ...  All paths of the routing matrix are operating simultaneously on the same input symbol, and the memory arrays are distributed throughout the chip, providing O (1) lookup for a 48 K bit memory word [13]  ... 
doi:10.14257/ijgdc.2016.9.4.05 fatcat:stay6tyhrzgdhfn4ptnv56oxq4

AIDA: Associative DNN Inference Accelerator [article]

Leonid Yavits, Roman Kaplan, Ran Ginosar
2018 arXiv   pre-print
We propose AIDA, an inference engine for accelerating fully-connected (FC) layers of Deep Neural Network (DNN).  ...  AIDA is an associative in-memory processor, where the bulk of data never leaves the confines of the memory arrays, and processing is performed in-situ.  ...  A large number of CNN accelerators (DaDianNao [11] , Angel-Eye [5] , True North [9] ), some of which share with AIDA architectural techniques such as bit-serial arithmetic (Cnvlutin [12] ), or processing  ... 
arXiv:1901.04976v1 fatcat:gvpprlb4ozhpfamcmtiidloa5a

Memristive Device Based Circuits for Computation-in-Memory Architectures

Muath Abu Lebdeh, Uljana Reinsalud, Hoang Anh Du Nguyen, Stephan Wong, Said Hamdioui
2019 2019 IEEE International Symposium on Circuits and Systems (ISCAS)  
This paper addresses memristive circuit designs for CIM architectures.  ...  Computationin-Memory (CIM) architecture based on memristive devices is one of the alternative computing architectures being explored to address these limitations.  ...  More complex designs of vector-matrix multiplication (e.g., Dot Product Engine [13] , and ISAAC [14] ) utilizes analog circuits in the periphery (e.g., ADC and DAC) to perform arithmetic multiplications  ... 
doi:10.1109/iscas.2019.8702542 dblp:conf/iscas/LebdehRNWH19 fatcat:x5fjol3iyjfpzfqbpbtquxv4cy

Memristive Device Based Circuits for Computation-in-Memory Architectures

Muath Abu Lebdeh, Uljana Reinsalu, Hoang Anh Du Nguyen, Stephan Wong, Said Hamdioui
2019 Zenodo  
This paper addresses memristive circuit designs for CIM architectures.  ...  Computation-in-Memory (CIM) architecture based on memristive devices is one of the alternative computing architectures being explored to address these limitations.  ...  More complex designs of vector-matrix multiplication (e.g., Dot Product Engine [13] , and ISAAC [14] ) utilizes analog circuits in the periphery (e.g., ADC and DAC) to perform arithmetic multiplications  ... 
doi:10.5281/zenodo.3247160 fatcat:vblws7uourdajm474hkl26mmoi

An Efficient and Scalable RDF Indexing Strategy based on B-Hashed-Bitmap Algorithm using CUDA

Sharmi Sankar, Munesh Singh, Awny Sayed, Jihad Alkhalaf Bani-Younis
2014 International Journal of Computer Applications  
The crucial sparse matrix part of the proposed index is benchmarked against different CUDA memory implementations to derive optimal matrix processing options.  ...  Benchmarking the data provides promising results for a B+ tree based index coupled with hashing and sparse matrix implementations.  ...  Benchmarking Boolean Sparse matrix against different memory factors Efficient tiling methods using shared memories that take into account memory bank conflicts are experimented with the following results  ... 
doi:10.5120/18216-9221 fatcat:qtjlxp7dgje3nfdfk5vie47iqu

Sparse Matrix Multiplication On An Associative Processor [article]

L. Yavits, A. Morad, R. Ginosar
2017 arXiv   pre-print
Sparse matrix multiplication is an important component of linear algebra computations.  ...  Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where  ...  ACKNOWLEDGMENT This research was partially funded by the Intel Collaborative Research Institute for Computational Intelligence and by Hasso-Plattner-Institut.  ... 
arXiv:1705.07282v1 fatcat:n3tr6cnkpjemvmm7oycra7kav4

ReHy: A ReRAM-based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training

Hai Jin, Cong Liu, Haikun Liu, Ruikun Luo, Jiahong Xu, Fubing Mao, Xiaofei Liao
2021 IEEE Transactions on Parallel and Distributed Systems  
Resistive random access memory (ReRAM) has been widely used in PIM architectures due to its extremely high efficiency for accelerating matrix-vector multiplications through analog computing.  ...  We exploit the capability of ReRAM for Boolean logic operations to design the DPIM architecture.  ...  In our hybrid architecture, we use a ReRAM-based PIM accelerator and general-purpose processors for data processing. The main memory is shared by traditional CPUs and PIM accelerators.  ... 
doi:10.1109/tpds.2021.3138087 fatcat:7ysqhvgmbvcl5eycecvl465lgy

Real-time Simulation and Optimization of Elastic Aircraft Vehicle Based on Multi-GPU Workstation

Binxing Hu, Xingguo Li
2019 IEEE Access  
Accordingly, the optimized performance is enhanced through the adaptive hardware resources and rational use of shared memory.  ...  Meanwhile, an innovative parallel algorithm of element stiffness matrix based on finite element model is designed in GPU architecture.  ...  [23] proposed using CUDA technology to accelerate sparse matrix-vector multiplication under Hadoop architecture.  ... 
doi:10.1109/access.2019.2946684 fatcat:v2arypyusbgdtglub5t2b5dyoi

Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning

Mahdi Nazm Bojnordi, Engin Ipek
2016 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)  
The proposed accelerator exploits the electrical properties of RRAM to realize in situ, fine-grained parallel computation within memory arrays, thereby eliminating the need for exchanging data between  ...  This paper examines a new class of hardware accelerators for large-scale combinatorial optimization and deep learning based on memristive Boltzmann machines.  ...  The authors would like to thank anonymous reviewers for useful feedback. This work was supported in part by NSF grant CCF-1533762.  ... 
doi:10.1109/hpca.2016.7446049 dblp:conf/hpca/BojnordiI16 fatcat:exms5os62rbvrm3qti6k74dieu

Lightweight DMA management mechanisms for multiprocessors on FPGA

Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto
2008 2008 International Conference on Application-Specific Systems, Architectures and Processors  
This paper presents a multiprocessor system on FPGA that adopts Direct Memory Access (DMA) mechanisms to move data between the external memory and the local memory of each processor.  ...  This interface allows to program the embedded multiprocessor architecture on FPGA with simple DMAs using the same DMA techniques adopted on high performance multiprocessors with complex DMA controllers  ...  The RGB to YUV color conversion and the Matrix Multiplication obtain the most significant speed up.  ... 
doi:10.1109/asap.2008.4580191 dblp:conf/asap/TumeoMPFS08 fatcat:k5o3rmecb5gcjijl4ukk56bbky

RESPARC: A Reconfigurable and Energy-Efficient Architecture with Memristive Crossbars for Deep Spiking Neural Networks [article]

Aayush Ankit, Abhronil Sengupta, Priyadarshini Panda, Kaushik Roy
2017 arXiv   pre-print
RESPARC advances this by proposing a complete system for SNN acceleration and its subsequent analysis.  ...  Furthermore, RESPARC is a technology-aware architecture that maps a given SNN topology to the most optimized MCA size for the given crossbar technology.  ...  This necessitates partitioning the connectivity matrix to map it across multiples MCAs.  ... 
arXiv:1702.06064v1 fatcat:mv75rfmiu5g7ri3hnrcru7jele
« Previous Showing results 1 — 15 out of 3,223 results