1,274 Hits in 8.3 sec

High performance reconfigurable computing for numerical simulation and deep learning

Lin Gan, Ming Yuan, Jinzhe Yang, Wenlai Zhao, Wayne Luk, Guangwen Yang
2020 CCF Transactions on High Performance Computing  
Data from the memory will be streamed through different pipelines for higher computing efficiencies.  ...  gate array (FPGA) technology, bring a brand-new and completely different computing pattern that mostly relies on a data flow computing model for achieving better performance.  ...  For example, Arria10 GX1150 FPGA has more than 1500 floating point cores, the same series of FPGA maximum floating-point performance can achieve 1.5 TFLOPS.  ... 
doi:10.1007/s42514-020-00032-x fatcat:mbnb73zazzgohhe4quhuqlryky

Bandwidth Enhancement between Graphics Processing Units on the Peripheral Component Interconnect Bus

2015 Journal of Electrical and Electronics Engineering  
In this paper we show that special pur pose compression algorithms designed for scientific floating point data can be used to enhance the bandwidth between 2 graphics processing unit (GPU) devices on the  ...  Parallel data compression is a difficult topic but compression has been used successfully to improve the communication between parallel message passing interface (MPI) processes on high performance computing  ...  Martin Burtscher for providing the "GFC" implementation and the LU dataset [8] . This work was partially supported by the strategic grant POSDRU/159/1.  ... 
doaj:9e77d75401c44ad4a27208e731316bf3 fatcat:pthnt7kbsvc77gzwkdnqe7zwpe

TerseCades: Efficient Data Compression in Stream Processing

Gennady Pekhimenko, Chuanxiong Guo, Myeongjae Jeon, Peng Huang, Lidong Zhou
2018 USENIX Annual Technical Conference  
This work is the first systematic investigation of stream processing with data compression: we have not only identified a set of factors that influence the benefits and overheads of compression, but have  ...  also demonstrated that compression can be effective for stream processing, both in the ability to process in larger windows and in throughput.  ...  Lossy Compression for Floating Point Data. The nature of floating point value representation makes it difficult to get high compression ratio from classical Base-Delta encoding.  ... 
dblp:conf/usenix/PekhimenkoGJHZ18 fatcat:gw2kvz53zzdr7abqisntvaj7rm

Exploring super-resolution implementations across multiple platforms

Brian Leung, Seda Ogrenci Memik
2013 EURASIP Journal on Advances in Signal Processing  
A high-performance FPGA can have comparable performance and rival the GPGPU in some cases.  ...  A major issue with higher quality video is that either more data bandwidth or storage resources must be dedicated for transferring or storing the video.  ...  In addition to having applications for streaming video, super-resolution has many applications in numerous fields.  ... 
doi:10.1186/1687-6180-2013-116 fatcat:mmn2t4l63za4ffot7vhez3ytna

Cascading Deep Pipelines to Achieve High Throughput in Numerical Reduction Operations

Mingjie Lin, Shaoyi Cheng, John Wawrzynek
2010 2010 International Conference on Reconfigurable Computing and FPGAs  
For a wide variety of convex optimization problems found in the decoding stage of compressive sensing, when comparing running the L1-MAGIC software package [1] on a 2.4 GHz Core 2 Duo Intel processor,  ...  This work proposes a cascaded and pipelined (CAP) reconfigurable architecture to achieve high throughput in executing numerical reduction operations commonly found in many scientific computations by (1  ...  Floating-Point Unit One challenge of this study is performing floating-point computations with FPGAs.  ... 
doi:10.1109/reconfig.2010.70 dblp:conf/reconfig/LinCW10 fatcat:cinef2gg4bcbhllb5g6dnhznge

Solving the Global Atmospheric Equations through Heterogeneous Reconfigurable Platforms

Lin Gan, Haohuan Fu, Wayne Luk, Chao Yang, Wei Xue, Xiaomeng Huang, Youhui Zhang, Guangwen Yang
2015 ACM Transactions on Reconfigurable Technology and Systems  
Moreover, by using fixed-point and reduced-precision floating point arithmetic, we manage to build a fully pipelined mixed-precision design on a single FPGA, which can perform 428 floating-point and 235  ...  Through a careful adjustment of the computational domains, we achieve a balanced resource utilization and a further improvement of the overall performance.  ...  To decrease Band r , we can either use the hardware (de)compression scheme , or optimize the algorithm to decrease the number of streams ] and the data bytes.  ... 
doi:10.1145/2629581 fatcat:gq2vc7wnyzhaxkzm4vnlso526y

Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra [article]

Mohammadreza Soltaniyeh, Richard P. Martin, Santosh Nagarakatte
2020 arXiv   pre-print
This paper describes REAP, a software-hardware approach that enables high performance sparse linear algebra computations on a cooperative CPU-FPGA platform.  ...  The computation is optimized on the FPGA for effective resource utilization with pipelining.  ...  ACKNOWLEDGMENT This paper is based on work supported in part by NSF CAREER Award CCF1453086, NSF Award CCF-1917897, NSF Award CCF-1908798, NSF CNS-1836901, NSF OAC-1925482, and Intel Corporation.  ... 
arXiv:2004.13907v1 fatcat:ced5hi2yfvgg3lyau2xrdrtqjq

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Richard Dorrance, Fengbo Ren, Dejan Marković
2014 Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays - FPGA '14  
This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources.  ...  However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to  ...  The authors would like to thank Yuta Toriyama and Fang-Li Yuan of UCLA for their helpful discussions.  ... 
doi:10.1145/2554688.2554785 dblp:conf/fpga/DorranceRM14 fatcat:vyj6zdpwujddzje2pqcsxpytty

Performance Evaluation of Finite-Difference Time-Domain (FDTD) Computation Accelerated by FPGA-based Custom Computing Machine

Kentaro SANO, Yoshiaki HATSUDA, Luzhou WANG, Satoru YAMAMOTO
2009 Interdisciplinary Information Sciences  
This paper evaluates the performance of the 2D FDTD computation on our FPGA-based array processor.  ...  So far, we have proposed the systolic computational-memory architecture for custom computing machines tailored for numerical computations with difference schemes, and implemented the array-processor based  ...  For scalable memory-size, we are planning to introduce external-memory support, connection of multiple FPGAs and compression of floating-point data.  ... 
doi:10.4036/iis.2009.67 fatcat:2onxfdmqbbbxdls2rl3v2zu264

Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms

D. Chen, D. Singh
2013 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)  
Fractal compression is an efficient technique for image and video encoding that uses the concept of self-referential codes.  ...  We demonstrate that the core computation implemented on the FPGA through OpenCL is 3x faster than a high-end GPU and 114x faster than a multi-core CPU, with significant power advantages.  ...  It contains 32 floating-point computing units, or Stream Processors (SP), and 4 special function units for transcendental calculations.  ... 
doi:10.1109/aspdac.2013.6509612 dblp:conf/aspdac/ChenS13 fatcat:72dcd7veqvabtiwydwa7pfqtjq

CEAZ: Accelerating Parallel I/O Via Hardware-Algorithm Co-Designed Adaptive Lossy Compression [article]

Chengming Zhang, Sian Jin, Tong Geng, Jiannan Tian, Ang Li, Dingwen Tao
2021 arXiv   pre-print
However, little work has been done for effectively offloading lossy compression onto FPGA-based SmartNICs to reduce the compression overhead.  ...  As parallel computers continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding.  ...  Background and Motivation Floating-Point Data Compression Floating-point data compression has been studied for decades.  ... 
arXiv:2106.13306v2 fatcat:42fvquu3trcgxncxwdl5izksra

A Review of FPGA‐Based Custom Computing Architecture for Convolutional Neural Network Inference

Peng Xiyuan, Yu Jinxiang, Yao Bowen, Liu Liansheng, Peng Yu
2021 Chinese journal of electronics  
At present, the performance of general processors cannot meet the requirement for CNN models with high computation complexity and large number of parameters.  ...  In this paper, the mainstream methods of CNN structure design, hardwareoriented model compression and FPGA-based custom architecture design are summarized, and the improvement of CNN inference performance  ...  data stream.  ... 
doi:10.1049/cje.2020.11.002 fatcat:vt4n4x67k5g6bhkywe7rhm7tda


Paul Grigoras, Pavel Burovskiy, Wayne Luk
2016 Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16  
To improve the performance and applicability of FPGA based SpMV, we propose an approach for exploiting properties of the input matrix to generate optimised custom architectures.  ...  The architectures generated by our approach are between 3.8 to 48 times faster than the worst case architectures for each matrix, showing the benefits of instance specific design for SpMV.  ...  For [7] we optimistically halve the performance of the design, although the resource cost of floating point units increases quadratically with word length and using double precision storage for x reduces  ... 
doi:10.1145/2847263.2847338 dblp:conf/fpga/GrigorasBL16 fatcat:ar2gnotaizhsrh4fgnwngkie2q

A Massively Parallel Digital Learning Processor

Hans Peter Graf, Srihari Cadambi, Igor Durdanovic, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat T. Chakradhar
2008 Neural Information Processing Systems  
The memory bandwidth thus scales with the number of VPEs, while the main data flows are local, keeping power dissipation low.  ...  This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far.  ...  Operations with Computation: All loads and stores from the FPGA to off-chip memory are performed concurrently with computations. • High Off-chip Memory Bandwidth: 6 independent data ports, each 32 bits  ... 
dblp:conf/nips/GrafCDJSCC08 fatcat:edqizph455bv3khutz7yfl5mdu

State-of-the-art in Heterogeneous Computing

Andre R. Brodtkorb, Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, Olaf O. Storaasli
2010 Scientific Programming  
With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of  ...  Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or  ...  ACKNOWLEDGEMENTS The authors would like to thank Gernot Ziegler at NVIDIA Corporation, Knut-Andreas Lie and Johan Seland at SINTEF ICT, and Praveen Bhaniramka and Gaurav Garg at Visualization Experts Limited for  ... 
doi:10.1155/2010/540159 fatcat:xu4n5ubgfzh3bobd445cmg7qyu
« Previous Showing results 1 — 15 out of 1,274 results