Filters








1,377 Hits in 4.5 sec

Design of a parallel vector access unit for SDRAM memory systems

B.K. Mathew, S.A. McKee, J.B. Carter, A. Davis
Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)  
This paper describes a Parallel Vector Access unit (PVA), the vector memory subsystem that efficiently "gathers" sparse, strided data structures in parallel on a multibank SDRAM memory.  ...  On unit-stride vectors, PVA performance equals or exceeds that of an SDRAM system optimized for cache line fills. On vectors with larger strides, the PVA is up to 32.8 times faster.  ...  Discussion In this paper, we have described the design of a Parallel Vector Access unit (PVA) for the Impulse smart memory controller.  ... 
doi:10.1109/hpca.2000.824337 dblp:conf/hpca/MathewMCD00 fatcat:hksgwynnfrc25le6qsuj7jhcva

PVMC: Programmable Vector Memory Controller

Tassadaq Hussain, Oscar Palomar, Osman Unsal, Adrian Cristal, Eduard Ayguade, Mateo Valero
2014 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors  
In this work, we propose a Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized local memory, a memory  ...  When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 2.2x to 14.9x faster, achieves between 2.16x to 3.18x of speedup for 5 applications and consumes  ...  The design uses a Multi DRAM Access Unit that manages memory accesses of multiple SDRAM modules.  ... 
doi:10.1109/asap.2014.6868668 dblp:conf/asap/HussainPUCAV14 fatcat:bfz7dlvbrzcctjie54eihfhiju

Memory Controller for Vector Processor

Tassadaq Hussain, Oscar Palomar, Osman S. Ünsal, Adrian Cristal, Eduard Ayguadé
2016 Journal of Signal Processing Systems  
In this work, we propose an Advanced Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized on-chip memory  ...  We compare the performance of a system with vector and scalar processors without PVMC.  ...  The design uses a Multi DRAM Access Unit that manages memory accesses of multiple SDRAM modules.  ... 
doi:10.1007/s11265-016-1215-5 fatcat:fhvdnzm5dfbxddxwomq2mmabhq

AMC: Advanced Multi-accelerator Controller

Tassadaq Hussain, Amna Haider, Shakaib A. Gursal, Eduard Ayguadé
2015 Parallel Computing  
Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator's memory access patterns efficiently.  ...  The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip.  ...  The Load unit accesses the 3D-Stencil vector at the start of each row of the 3D-Memory volume. For the following vector access, the Load unit transfers control to the Update unit.  ... 
doi:10.1016/j.parco.2014.10.003 fatcat:z7xne5erxjbihk54ns6kjwjpve

Algorithmic foundations for a parallel vector access memory system

Binu K. Mathew, Sally A. McKee, John B. Carter, Al Davis
2000 Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures - SPAA '00  
The Parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system.  ...  This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns.  ...  Acknowledgments The authors thank Ganesh Gopalakrishnan for his contributions to the early stages of this project.  ... 
doi:10.1145/341800.341819 dblp:conf/spaa/MathewMCD00 fatcat:subxkqmdkjagfpblkvggt3n22q

FPGA Acceleration of Recurrent Neural Network Based Language Model

Sicheng Li, Chunpeng Wu, Hai Li, Boxun Li, Yu Wang, Qinru Qiu
2015 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines  
A multi-thread based computation engine is utilized which can successfully mask the long memory latency and reuse frequent accessed data.  ...  At architectural level, we improve the parallelism of RNN training scheme and reduce the computing resource requirement for computation efficiency enhancement.  ...  Figure 8 . 8 SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM The thread management unit in a computation engine.  ... 
doi:10.1109/fccm.2015.50 dblp:conf/fccm/LiWLLWQ15 fatcat:dk66yqbdfvc2niu2acs3rwfn3q

Performance characteristics of MAUI

Justin Teller, Charles B. Silio, Bruce Jacob
2005 Proceedings of the 2005 workshop on Memory system performance - MSP '05  
Because the "intelligence" of the MAUI intelligent memory system architecture is located in the memory-controller, logic and DRAM are not required to be integrated into a single chip, and use of off-the-shelf  ...  Simulation results show single-threaded application speedup of over 100% is possible, and suggest that a total system speedup of about 300% is possible in a multi-threaded environment.  ...  The availability of a MAUI enhanced memory system may significantly change the way that operating systems are designed and implemented.  ... 
doi:10.1145/1111583.1111590 dblp:conf/ACMmsp/TellerSJ05 fatcat:j5ckax5325dtzlbpd7ictbc7re

Traffic shaping for an FPGA based SDRAM controller with complex QoS requirements

Sven Heithecker, Rolf Ernst
2005 Proceedings of the 42nd annual conference on Design automation - DAC '05  
Today high-end video and multimedia processing applications require huge amounts of memory. For cost reasons, the usage of conventional dynamic RAM (SDRAM) is preferred.  ...  In [8], a multi-stream DDR-SDRAM controller IP covering combinations of low latency requirements for processor cache access, hard realtime constraints for periodic video signals and hard real-time bursty  ...  [12] provide optimization heuristics for known memory access patterns of a single processor. None of these schedulers support vastly different access types at close to peak SDRAM bandwidth.  ... 
doi:10.1145/1065579.1065729 dblp:conf/dac/HeitheckerE05 fatcat:t3u7vpngzrdrxfwdsc5irlamym

Traffic shaping for an FPGA based SDRAM controller with complex QoS requirements

S. Heithecker, R. Ernst
2005 Proceedings. 42nd Design Automation Conference, 2005.  
Today high-end video and multimedia processing applications require huge amounts of memory. For cost reasons, the usage of conventional dynamic RAM (SDRAM) is preferred.  ...  In [8], a multi-stream DDR-SDRAM controller IP covering combinations of low latency requirements for processor cache access, hard realtime constraints for periodic video signals and hard real-time bursty  ...  [12] provide optimization heuristics for known memory access patterns of a single processor. None of these schedulers support vastly different access types at close to peak SDRAM bandwidth.  ... 
doi:10.1109/dac.2005.193876 fatcat:ohdh5e24rvcxnbxrelhhjs3lsq

Cache optimization for an embedded MPEG-4 video decoder

Hongxing Guo, Tao Sheng, Weiping Sun, Jingli Zhou, Shengsheng Yu
2006 2006 8th international Conference on Signal Processing  
However, the high data rate, large sizes, and distinctive memory access patterns of MPEG-4 video decoders exert a particular strain on cache.  ...  Due to its importance, a cache-based memory allocation mode is proposed to make full use of the SRAM.  ...  In addition, the authors thank the reviewers for their helpful comments and constructive criticism.  ... 
doi:10.1109/icosp.2006.345501 fatcat:nlixzqjg4vcivevpat7mijhziy

FPGA architecture and implementation of sparse matrix–vector multiplication for the finite element method

Yousef Elkurdi, David Fernández, Evgueni Souleimanov, Dennis Giannacopoulos, Warren J. Gross
2008 Computer Physics Communications  
For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance.  ...  We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications  ...  Jonathan Rose, for their guidance and support with the TM4 system.  ... 
doi:10.1016/j.cpc.2007.11.014 fatcat:n676arqruvaxbcmxlaf22tv5yi

FPGA and GPU implementation of large scale SpMV

Yi Shan, Tianji Wu, Yu Wang, Bo Wang, Zilong Wang, Ningyi Xu, Huazhong Yang
2010 2010 IEEE 8th Symposium on Application Specific Processors (SASP)  
In the FPGA implementation, we designed the task partition and memory hierarchy according to the analysis of datasets scale and their access pattern.  ...  Sparse matrix-vector multiplication (SpMV) is a fundamental operation for many applications.  ...  The matrix data access can utilize the high bandwidth of the DDRx SDRAM for continuous access, but the demand for random access for the vector pushes us to duplicate many SRAMs in order to retain low access  ... 
doi:10.1109/sasp.2010.5521144 dblp:conf/sasp/ShanWWWWXY10 fatcat:o2idlrhklrcv3hjz3somrk5wpy

An Efficient Reference Frame Storage Scheme for H.264 HDTV Decoder

Peng Zhang, Wen Gao, Di Wu, Don Xie
2006 2006 IEEE International Conference on Multimedia and Expo  
Pixel duplication completely eliminates the possibility of an access crossing word boundary and therefore substantially increases the memory bandwidth efficiency.  ...  L-C correlated mapping exploits address relationships between the luma and chroma reference pixels and largely reduces bank conflict overhead of memory accesses.  ...  The throughput of processing units can be improved by advanced ASIC process and design methodology exploiting parallelism.  ... 
doi:10.1109/icme.2006.262511 dblp:conf/icmcs/ZhangGWX06 fatcat:57qroccodjevfgyy53aemdwreu

FPGA Implementation of ECT Digital System for Imaging Conductive Materials

Wael Deabes
2019 Algorithms  
Therefore, a reconfigurable segmented parallel inner product architecture for the parallel matrix multiplication is proposed.  ...  The proposed system achieves high performance in terms of speed and small design density.  ...  Acknowledgments: The author would like to thank the Deanship of Scientific Research at Umm Al-Qura University for the financial support (Project No.: 43308004).  ... 
doi:10.3390/a12020028 fatcat:76mhxiasgvgylnbvghe45fnuzq

Methods for Power/Throughput/Area Optimization of H.264/AVC Decoding

Ke Xu, Tsu-Ming Liu, Jiun-In Guo, Chiu-Sing Choy
2009 Journal of Signal Processing Systems  
This paper presents methods for efficient optimization of ASIC implementation for H.264/AVC video decoding. A systematic approach in optimization is presented in a top-down flow.  ...  Finally, we provide the design guidelines for ASIC implementation based on the analysis and our design experiences of five dedicated decoder chips.  ...  Table 9 9 Chip power evaluation (excluding SDRAM). a < 1.985unit 2 unit < a < 5.33 unit 5.33unit < a Min. power NoC PoC FoC Table 10 10 System power evaluation (including SDRAM).  ... 
doi:10.1007/s11265-009-0408-6 fatcat:dhvbpcut5bgonhbe5y2dbetc5m
« Previous Showing results 1 — 15 out of 1,377 results