Filters








190 Hits in 3.9 sec

Programming Heterogeneous Systems from an Image Processing DSL [article]

Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, Mark Horowitz
2016 arXiv   pre-print
Starting with Halide not only provides a very high-level functional description of the hardware, but also allows our compiler to generate the complete software program including the sequential part of  ...  We address this problem by extending the image processing language, Halide, so users can specify which portions of their applications should become hardware accelerators, and then we provide a compiler  ...  that pulls accelerator tasks directly from a memory buffer without CPU intervention, as opposed to having a background thread pushing tasks to the accelerator.  ... 
arXiv:1610.09405v1 fatcat:p2qq2gcifnez7mtrswcl2h2vfy

Extending Halide to Improve Software Development for Imaging DSPs

Sander Vocke, Henk Corporaal, Roel Jordans, Rosilde Corvino, Rick Nas
2017 ACM Transactions on Architecture and Code Optimization (TACO)  
I propose a set of extensions and modifications to Halide in order to support DSPs in combination with arbitrary C compilers, including a template solution to support diverse target instruction sets and  ...  heterogeneous scratchpad memories.  ...  This enables programmers (and potentially autoschedulers) to push the performance of multiple programs beyond that of the individual versions.  ... 
doi:10.1145/3106343 fatcat:lhwzjx4levafvbbxxl4mrzn2o4

Pushing the Level of Abstraction of Digital System Design: a Survey on How to Program FPGAs

Emanuele Del Sozzo, Davide Conficconi, Alberto Zeni, Mirko Salaris, Donatella Sciuto, Marco D. Santambrogio
2022 ACM Computing Surveys  
They are state-of-the-art for prototyping, telecommunications, embedded, and an emerging alternative for cloud-scale acceleration.  ...  We review these abstraction solutions, provide a timeline, and propose a taxonomy for each abstraction trend: programming models for HDLs; IP-based or System-based toolchains for HLS; application, architecture  ...  ACKNOWLEDGEMENTS The authors are grateful for feedbacks from Reviewers and NECSTLab members, with a particular mention to A. Damiani, A. Parravicini, E. D'Arnese, F. Carloni, F. Peverelli, and R.  ... 
doi:10.1145/3532989 fatcat:nsk5lwvt3vba5fbxmaj7sgpwru

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning [article]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy
2018 arXiv   pre-print
We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator.  ...  TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding.  ...  This work was supported in part by a Google PhD Fellowship for Tianqi Chen, ONR award #N00014-16-1-2795, NSF under grants CCF-1518703, CNS-1614717, and CCF-1723352, and gifts from Intel (under the CAPA program  ... 
arXiv:1802.04799v3 fatcat:e6htzyqaqjhpnm3yyi6xl3mdoq

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Zihan Liu, Jingwen Leng, Guandong Lu, Chenhui Wang, Quan Chen, Minyi Guo
2020 CCF Transactions on High Performance Computing  
However, different vendors adopt different accelerator architectures, making it challenging for the compiler tool-chain to generate and optimize high-performance codes.  ...  Moreover, the current tool-chains provided by the vendors are either highly abstract, which makes it hard to optimize or contain too many hardware-related details, which makes it inconvenient to program  ...  This work was supported by National Key R&D Program of China (2019YFF0302600) and the National Natural Science Foundation of China (NSFC) Grant (61702328 and 61832006).  ... 
doi:10.1007/s42514-020-00044-7 fatcat:75s4i3vq5fcfzdct6hgw2wg2va

TC-CIM: Empowering Tensor Comprehensions for Computation in Memory

Andi Drebes, Lorenzo Chelini, Oleksandr Zinenko, Albert Cohen, Henk Corporaal, Tobias Grosser, Kanishkan Vadivel, Nicolas Vasilache
2020 Zenodo  
We demonstrate the programmability of memristor-based accelerators with TC-CIM, a fully-automatic, end-to-end compilation flow from Tensor Comprehensions, a mathematical notation for tensor operations,  ...  Our results show that TC-CIM reliably recognizes tensor operations commonly used in ML workloads across multiple benchmarks in order to offload these operations to the accelerator.  ...  agreement, id. 780215 and the NeMeCo grant agreement, id. 676240 as well as through Polly Labs (Xilinx Inc, Facebook Inc, and ARM Holdings) and the Swiss National Science Foundation through the Ambizione program  ... 
doi:10.5281/zenodo.3736308 fatcat:k4yrzeo4jjdwvdrizc77poxblq

Decoupling algorithms from schedules for easy optimization of image processing pipelines

Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, Frédo Durand
2012 ACM Transactions on Graphics  
Our compiler targets SIMD units, multiple cores, and complex memory hierarchies.  ...  We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and  ...  A chain of Halide functions can be JIT compiled and used immediately, or it can be compiled to an object file and header to be used by some other program (which need not link against Halide).  ... 
doi:10.1145/2185520.2185528 fatcat:mdfbhjwc5zatfoq4mt6p3j7m7a

Decoupling algorithms from schedules for easy optimization of image processing pipelines

Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, Frédo Durand
2012 ACM Transactions on Graphics  
Our compiler targets SIMD units, multiple cores, and complex memory hierarchies.  ...  We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and  ...  A chain of Halide functions can be JIT compiled and used immediately, or it can be compiled to an object file and header to be used by some other program (which need not link against Halide).  ... 
doi:10.1145/2185520.2335383 fatcat:smf4xdhjpzhn3ggl2qyilol22y

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures [article]

Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, Torsten Hoefler
2020 arXiv   pre-print
The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist.  ...  To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations.  ...  This information is later used to generate exact memory copy calls to/from accelerators.  ... 
arXiv:1902.10345v3 fatcat:4aerjkgf2fguhlbcbrw7g2uw5e

Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications

Joost Hoozemans, Rob de Jong, Steven van der Vlugt, Jeroen Van Straten, Uttam Kumar Elango, Zaid Al-Ars
2019 Journal of Signal Processing Systems  
Second, we use softcore VLIW processors, that are targetable by a C compiler and have hardware debugging capabilities, to evaluate and debug the software before moving to a High-Level Synthesis flow.  ...  Our proposed platform allows both software developers and hardware designers to test iterations in a matter of seconds (compilation time) instead of hours (synthesis or circuit simulation time).  ...  To facilitate the optimization process and to aid design space exploration for various target execution platforms, the Halide programming language and compiler can be used to generate code from a functional  ... 
doi:10.1007/s11265-018-1422-3 pmid:30873259 pmcid:PMC6390719 fatcat:cmnzbsesdvhwbkqgf46bpdb4my

Relay: A High-Level Compiler for Deep Learning [article]

Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Logan Weber, Josh Pollock, Luis Vega, Ziheng Jiang, Tianqi Chen, Thierry Moreau, Zachary Tatlock
2019 arXiv   pre-print
We present Relay, a new compiler framework for DL. Relay's functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of-the-art models.  ...  Our evaluation demonstrates Relay's competitive performance for a broad class of models and devices (CPUs, GPUs, and emerging accelerators).  ...  to the FPGA-based accelerator. (2) Push-button quantization: Relay can take a fp32 model and convert its parameters to int8 in order to enable inference on specialized accelerators. (3) Accelerator-friendly  ... 
arXiv:1904.08368v2 fatcat:edkc2clp6jdu7d6fceuoj7e4g4

LIFT: A functional data-parallel IR for high-performance GPU code generation

Michel Steuwer, Toomas Remmelg, Christophe Dubach
2017 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)  
This paper describes how Lift IR programs are compiled into efficient OpenCL code.  ...  However, compiling high-level programs into efficient lowlevel parallel code is challenging.  ...  High-level languages such as Accelerate [16] , Delite [20] , StreamIt [21] or Halide [17] have been proposed to ease programming of GPUs.  ... 
doi:10.1109/cgo.2017.7863730 fatcat:hzy2zqwhvndznb6424rji53y2m

Automatic Systolic Array Generation using Reusable Blocks

Liancheng Jia, Liqiang Lu, Xuechao Wei, Yun Liang
2020 IEEE Micro  
The first DF/DC modules connect to off-chip memory. Since the bandwidth between off-chip memory and the systolic accelerator is often limited, input data is reused multiple times in DFs.  ...  The advantage of DSL over pure HLS is the separation of computation and dataflow definition in DSL programming, which was first proposed by an image processing DSL Halide. 6 Following works extends Halide  ... 
doi:10.1109/mm.2020.2997611 fatcat:tizuuj2q2nf3lntkbvja3dhl7u

Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies

Bastian Hagedorn, Johannes Lenfers, Thomas Kœhler, Xueying Qin, Sergei Gorlatch, Michel Steuwer
2020 Proceedings of the ACM on Programming Languages (PACMPL)  
In some systems such as Halide or TVM, a separate schedule specifies how the program should be optimized. Unfortunately, these schedules are not written in well-defined programming languages.  ...  This results in a portability nightmare that is particularly problematic given the accelerating trend towards specialized hardware devices to further increase efficiency.  ...  programs using Halide.  ... 
doi:10.1145/3408974 fatcat:f72dfpyvpja63nauihpuknf3ue

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions [article]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, Albert Cohen
2018 arXiv   pre-print
Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning  ...  DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated  ...  Using such an approach, DSL compilers such as Halide [72] for image processing show impressive results.  ... 
arXiv:1802.04730v3 fatcat:2ef5ete4mvao5bz43h7z7dtlwi
« Previous Showing results 1 — 15 out of 190 results