360 Hits in 6.2 sec

Acceleration techniques and evaluation on multi-core CPU, GPU and FPGA for image processing and super-resolution

Georgios Georgis, George Lentaris, Dionysios Reisis
2016 Journal of Real-Time Image Processing  
The FPGA design leads to a scalable architecture performing four (4x) times faster than the real-time on low-end Xilinx Virtex 5 devices and sixty-nine times (69x) faster than the real-time on the Virtex  ...  The proposed techniques accelerate GPU reconstruction of Ultra-High Definition content, by achieving three (3x) times faster than the real-time performance on mid-range and previous generation devices  ...  Table 6 : 6 Quality performance of the SIL-SEABI implementations on CPU, GPU and FPGA platforms. SIL-SEABI Implementations Quality Platform: CPU GPU FPGA output size (Ref.)  ... 
doi:10.1007/s11554-016-0619-6 fatcat:3xkd4eex3bexdgjw4p7sjbbmbe

Mapping a data-flow programming model onto heterogeneous platforms

Alina Sbîrlea, Yi Zou, Zoran Budimlíc, Jason Cong, Vivek Sarkar
2012 Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems - LCTES '12  
We demonstrate a working example that maps a pipeline of medical image-processing algorithms onto a prototype heterogeneous platform that includes CPUs, GPUs and FPGAs.  ...  In this paper we explore mapping of a high-level macro data-flow programming model called Concurrent Collections (CnC) onto heterogeneous platforms in order to achieve high performance and low energy consumption  ...  Acknowledgments We thank the Center for Domain Specific Computing (NSF Expeditions in Computing Award CCF-0926127) that funded this work.  ... 
doi:10.1145/2248418.2248428 dblp:conf/lctrts/SbirleaZBCS12 fatcat:pt3s2jlcibehho65hstsw65ahm

Energy-efficient FPGA Implementation of the k-Nearest Neighbors Algorithm Using OpenCL

Fahad Muslim, Alexandros Demian, Liang Ma, Luciano Lavagno, Affaq Qamar
2016 Position Papers of the 2016 Federated Conference on Computer Science and Information Systems  
High-level Synthesis (HLS) simplifies FPGA programming by allowing designers to program FPGAs in several high-level languages e.g. C/C++, OpenCL and SystemC.  ...  Furthermore, using an FPGA-specific OpenCL coding style and providing appropriate HLS directives can yield an FPGA implementation comparable to a GPU also in terms of execution time.  ...  This work is also supported in part by the European Commission through the ECOSCALE project (H2020-ICT-671632).  ... 
doi:10.15439/2016f327 dblp:conf/fedcsis/MuslimDMLQ16 fatcat:c7gspjezb5ek3dx2hmudvkedkm

Enabling development of OpenCL applications on FPGA platforms

Kavya Shagrithaya, Krzysztof Kepa, Peter Athanas
2013 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors  
The increased development time, level of experience needed by the developers, lower turns per day and difficulty involved in faster iterations over designs affect the time-to-market for many solutions.  ...  The flow uses Xilinx AutoESL tool to obtain the design specification for compute cores. An architecture provided integrates the cores with memory and host interfaces.  ...  The compute devices in a platform can be CPU, GPU, DSP, FPGA or any other accelerator.  ... 
doi:10.1109/asap.2013.6567546 dblp:conf/asap/ShagrithayaKA13 fatcat:5cb6mpbe35htjax7vn5skzbrwa

A Scalable Runtime for the ECOSCALE Heterogeneous Exascale Hardware Platform

Paul Harvey, Konstantin Bakanov, Ivor Spence, Dimitrios S. Nikolopoulos
2016 Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers - ROSS '16  
This position paper presents the design of a new runtime for a new heterogeneous hardware platform being developed to explore energy efficient, high performance computing.  ...  In particular, this work explores the use of FPGAs to achieve both the power and performance goals of exascale, as well as utilising the runtime to automatically effect dynamic configuration and reconfiguration  ...  An accelerator may be a CPU, GPU, FPGA, or co-processor such as the Xeon Phi [10] .  ... 
doi:10.1145/2931088.2931090 dblp:conf/hpdc/HarveyBSN16 fatcat:cr5mxbiwpncynfl3kx6fpx2o7a


Henry Wong, Hong Wang, Anne Bracy, Ethan Schuchman, Tor M. Aamodt, Jamison D. Collins, Perry H. Wang, Gautham Chinya, Ankur Khandelwal Groen, Hong Jiang
2008 Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT '08  
Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multicores, extending the current state-of-the-art CPU-GPU integration that physically  ...  We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU.  ...  Henry Wong and Tor Aamodt are partly supported by the Natural Sciences and Engineering Research Council of Canada.  ... 
doi:10.1145/1454115.1454125 dblp:conf/IEEEpact/WongBSACWCGJW08 fatcat:p37zbpaobza7pngzkxogk37fyy

Towards facilities for modeling and synthesis of architectures for resource allocation problem in systems engineering

Stephen Creff, Jérôme Le Noir, Eric Lenormand, Sébastien Madelénat
2020 Proceedings of the 24th ACM Conference on Systems and Software Product Line: Volume A - Volume A  
Exploring architectural design space is often beyond human capacity and makes architectural design a difficult task.  ...  More specifically, this work reports on the use of the Clafer modeling language and its gateway to the CSP Choco Solver, on an industrial case study of heterogeneous hardware resource allocation (GPP-GPGPU-FPGA  ...  This work discusses a possible approach to compute allocation schemes for hardware platforms with CPUs, GPUs and FPGAs nodes.  ... 
doi:10.1145/3382025.3414963 dblp:conf/splc/CreffNLM20 fatcat:tmwmzafabfadlo32ygqytsemha

Optimizing CNN-based Hyperspectral Image Classification on FPGAs [article]

Shuanglong Liu, Ringo S.W. Chu, Xiwei Wang, Wayne Luk
2019 arXiv   pre-print
Besides, previous CNN models used in HSI are not specially designed for efficient implementation on embedded devices such as FPGAs.  ...  A customized architecture which enables the proposed algorithm to be mapped effectively onto FPGA resources is then proposed to support real-time on-board classification with low power consumption.  ...  Besides, we propose and optimize the hardware architecture to accelerate our proposed network in FPGA by parallel processing, data pre-fetching and design space exploration.  ... 
arXiv:1906.11834v1 fatcat:arcbhexooja6hhmm4j5z4sgbei

Programming Heterogeneous Systems from an Image Processing DSL [article]

Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, Mark Horowitz
2016 arXiv   pre-print
Using its FPGA with two low-power ARM cores, our design achieves up to 6x higher performance and 8x lower energy compared to the quad-core ARM CPU on an NVIDIA Tegra K1, and 3.5x higher performance with  ...  We address this problem by extending the image processing language, Halide, so users can specify which portions of their applications should become hardware accelerators, and then we provide a compiler  ...  Lime [2] goes a step further by providing a unified language for CPU, GPU, and FPGA, with semantics to delineate boundaries between computation blocks.  ... 
arXiv:1610.09405v1 fatcat:p2qq2gcifnez7mtrswcl2h2vfy

Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS

Rafat Rashid, J. Gregory Steffan, Vaughn Betz
2014 2014 International Conference on Field-Programmable Technology (FPT)  
FPGA.  ...  The time required for initial hardware compilation of these TILT designs and configuration of the target application onto the overlay is roughly comparable to the compile times of the OpenCL HLS designs  ...  The host offloads the parallel compute intensive second portion defined within kernels onto accelerator(s) such as CPUs, GPUs and recently FPGAs [7] .  ... 
doi:10.1109/fpt.2014.7082748 dblp:conf/fpt/RashidSB14 fatcat:4xo72zlg6fgh7nj4v63pc2yqe4

Apps with Hardware: Enabling Run-time Architectural Customization in Smart Phones

Michael Coughlin, Ali Ismail, Eric Keller
2016 USENIX Annual Technical Conference  
We present our prototype smart phone using the Zedboard, which pairs a Xilinx Zynq FPGA with an embedded Cortex A9, running an Android-based system which we extended to provide run-time system support  ...  We introduce a novel mechanism to enable sharing the FPGA in a practical manner by leveraging the unique deployment model of mobile applications -namely that deployment is via an app store, where we introduce  ...  This research was supported in part by NSF SaTC grant number 1406192.  ... 
dblp:conf/usenix/CoughlinIK16 fatcat:l2327ch37vgppj6hhgbaq3dyyy

Efficient Machine Learning, Compilers, and Optimizations for Embedded Systems [article]

Xiaofan Zhang, Yao Chen, Cong Hao, Sitao Huang, Yuhong Li, Deming Chen
2022 arXiv   pre-print
Challenges also come from the diverse application-specific requirements, including real-time responses, high-throughput performance, and reliable inference accuracy.  ...  Deep Neural Networks (DNNs) have achieved great success in a massive number of artificial intelligence (AI) applications by delivering high-quality computer vision, natural language processing, and virtual  ...  There is a great amount of hardware-aware work, each of which often adopts a specific hardware device (CPU, GPU, embedded/mobile device) and requires a different hardware-cost metric (e.g., prioritizes  ... 
arXiv:2206.03326v1 fatcat:th66tbqxibez7hmctl2ytdiroa

Parallel Programming Models for Heterogeneous Many-Cores : A Survey [article]

Jianbin Fang, Chun Huang, Tao Tang, Zheng Wang
2020 arXiv   pre-print
While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to  ...  Intel has been developing oneAPI that includes DPC++ (an implementation of SYCL with extensions) for its CPUs, GPUs and FPGAs [27] .  ...  Recently, Intel has turned to implementing OpenCL for its CPUs, GPUs and FPGAs, and made its partial implementation open to the public [22] .  ... 
arXiv:2005.04094v1 fatcat:e2psrdnyajh3hih3znnjjbezae

Dynamic SIMD Parallel Execution on GPU from High-Level Dataflow Synthesis

Aurelien Bloch, Simone Casale-Brunet, Marco Mattavelli
2022 Journal of Low Power Electronics and Applications  
Nonetheless, such a design method might not be enough on its own to achieve the desired performance goals, and supporting tools are useful to be able to efficiently explore the design space so as to optimize  ...  Developing and fine-tuning software programs for heterogeneous hardware such as CPU/GPU processing platforms comprise a highly complex endeavor that demands considerable time and effort of software engineers  ...  CPU GPU  ... 
doi:10.3390/jlpea12030040 fatcat:2rhk5lszrrcxxdvgaolihqc57m

HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation [article]

Hanchen Ye, Xiaofan Zhang, Zhize Huang, Gengsheng Chen, Deming Chen
2020 arXiv   pre-print
Novel techniques include a highly flexible and scalable architecture with a hybrid Spatial/Winograd convolution (CONV) Processing Engine (PE), a comprehensive design space exploration tool, and a complete  ...  Experimental results show that the accelerators generated by HybridDNN can deliver 3375.7 and 83.3 GOPS on a high-end FPGA (VU9P) and an embedded FPGA (PYNQ-Z1), respectively, which achieve a 1.8x higher  ...  ACKNOWLEDGMENTS This work is supported in part by the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) and  ... 
arXiv:2004.03804v1 fatcat:2r7ymftbordw5odrfndowsuxg4
« Previous Showing results 1 — 15 out of 360 results