Filters








2,621 Hits in 6.7 sec

Automatically scheduling halide image processing pipelines

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, Kayvon Fatahalian
2016 ACM Transactions on Graphics  
Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine (a schedule), and the Halide compiler carries out the mechanical task of generating  ...  Unfortunately, designing high-performance schedules for complex image processing pipelines requires substantial knowledge of modern hardware architecture and code-optimization techniques.  ...  Parallelism Lab (supported by Oracle, AMD, Intel, and NVIDIA), and by equipment donations from NVIDIA.  ... 
doi:10.1145/2897824.2925952 fatcat:nr2o5mqsmncutdyxiuzrnont5e

A Stream Processing Framework for On-Line Optimization of Performance and Energy Efficiency on Heterogeneous Systems

Benjamin Ranft, Oliver Denninger, Philip Pfaffe
2014 2014 IEEE International Parallel & Distributed Processing Symposium Workshops  
Scheduling is automatically adapted on-line to continuously optimize performance and energy efficiency.  ...  Scheduling should thus consider the performance of each processor as well as competing workloads and varying inputs.  ...  For that reason and due to the merely moderate results of pipelining in [2] and [8] , Pipeline is the only strategy not to implement automatic performance and energy optimization in the class itself  ... 
doi:10.1109/ipdpsw.2014.119 dblp:conf/ipps/RanftDP14 fatcat:7yca7spnsbbprmk67o4yvmgcxe

CPU-GPU heterogeneous implementations of depth-based foreground detection

Younchang Choi, Jinseong Kim, Jaehak Kim, Yongwha Chung, Daihee Park, Sungju Lee
2018 IEICE Electronics Express  
the relative performance between the CPU and GPU.  ...  ) by balancing the total workload between CPU and GPU.  ...  Then, we applied the pipeline scheduling strategy by determining a computing device for each task and balancing the total execution times of CPU and GPU simultaneously.  ... 
doi:10.1587/elex.15.20170950 fatcat:hier7uoqrzhkdhyh6njkbzhf3i

Run-time Adaptation to Heterogeneous Processing Units for Real-time Stereo Vision

Benjamin Ranft, Oliver Denninger
2012 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems  
On this basis, we develop and implement further strategies for heterogeneous systems and automatic adaptation to the hardware available at run-time.  ...  Each approach is described concerning i. a. the propagation of data to processors and its relation to established methods.  ...  Stereo vision is the process of recovering 3D structure from images of two side-by-side cameras, making it an important basis for environmental perception.  ... 
doi:10.1109/hpcc.2012.232 dblp:conf/hpcc/RanftD12 fatcat:ezz5rn7e3jb67nc247nujjruve

A Distributed Framework for Low-Latency OpenVX over the RDMA NoC of a Clustered Manycore

Julien Hascoe, Benoet Dupont de Dinechin, Karol Desnos, Jean-Francois Nezan
2018 2018 IEEE High Performance extreme Computing Conference (HPEC)  
OpenVX is a standard proposed by the Khronos group for cross-platform acceleration of computer vision and deep learning applications.  ...  OpenVX abstracts the target processor architecture complexity and automates the implementation of processing pipelines through high-level optimizations.  ...  By contrast to OpenCV, the Khronos OpenVX standard [3] proposes a graph-based approach for the structured design of computer vision pipelines, where images flow as arcs between nodes, and nodes correspond  ... 
doi:10.1109/hpec.2018.8547736 dblp:conf/hpec/HascoetDDN18 fatcat:ri7iejxc2ffmhk3sntlilwxniq

Halide

Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman Amarasinghe, Frédo Durand
2017 Communications of the ACM  
Its model is simple enough to do so often in only a few lines of code, and small changes generate efficient implementations for x86, ARM, Graphics Processors (GPUs), and specialized image processors, all  ...  We propose a new programming language for image processing pipelines, called Halide, that separates the algorithm from its schedule.  ...  Most notably, Zalman Stern, Steven Johnson, and Patricia Suriana are full-time developers on the project at Google and are responsible for a large amount of the current code.  ... 
doi:10.1145/3150211 fatcat:4vhmxunjofam7daaaeiw5ssc7a

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Murad Qasaimeh, Kristof Denolf, Jack Lo, Kees Vissers, Joseph Zambreno, Phillip H. Jones
2019 2019 IEEE International Conference on Embedded Software and Systems (ICESS)  
While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2-22.3×.  ...  While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2-22.3×.  ...  The Visionworks library applies many optimization techniques to boost performance, such as buffer reuse, kernel fusion, efficient use of streaming and CUDA textures, automatic scheduling across processing  ... 
doi:10.1109/icess.2019.8782524 dblp:conf/icess/QasaimehDLVZJ19 fatcat:s2bfurzoi5cn3j523kmhk3ozcy

Rethinking Training from Scratch for Object Detection [article]

Yang Li, Hong Zhang, Yu Zhang
2021 arXiv   pre-print
Specifically, we propose a new training pipeline for object detection that follows 'pre-training and fine-tuning', utilizing low resolution images within target dataset to pre-training detector then load  ...  Under this situation, we discover that the widely adopted large resizing strategy e.g. resize image to (1333, 800) is important for fine-tuning but it's not necessary for pre-training.  ...  data diminishes the value of ImageNet pre-training. 3 Methodology Our aim is to setup a fast training pipeline for object detection.  ... 
arXiv:2106.03112v1 fatcat:tigh77gq2jbovjbci2r5cgbi6e

Programming Heterogeneous Systems from an Image Processing DSL [article]

Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, Mark Horowitz
2016 arXiv   pre-print
Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented  ...  that uses this code to automatically create the accelerator along with the "glue" code needed for the user's application to access this hardware.  ...  C HLS tools raise the design level by decoupling clock timing and automatically scheduling pipelines and other resources.  ... 
arXiv:1610.09405v1 fatcat:p2qq2gcifnez7mtrswcl2h2vfy

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Leyuan Wang, Zhi Chen, Yizhi Liu, Yao Wang, Lianmin Zheng, Mu Li, Yida Wang
2019 Proceedings of the 48th International Conference on Parallel Processing - ICPP 2019  
ACKNOWLEDGMENT The authors thank the anonymous reviewers of the paper for valuable comments.  ...  The authors are also grateful to Frank Chen and Long Gao for providing devices for experiments, and Tianqi Chen for technical assistance. The entire work was done at AWS.  ...  As an overview of the pipeline, we wrote an optimized schedule template (Section 3.2.2), and then used AutoTVM [6] as well as graph tuner [26] to search the best schedules for different workloads (  ... 
doi:10.1145/3337821.3337839 dblp:conf/icpp/WangCLWZLW19 fatcat:ptvsneujwjdmhesvcrune7rqwy

Extending Halide to Improve Software Development for Imaging DSPs

Sander Vocke, Henk Corporaal, Roel Jordans, Rosilde Corvino, Rick Nas
2017 ACM Transactions on Architecture and Code Optimization (TACO)  
that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study  ...  General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications  ...  Recent advances in automatic scheduling may enable programmers to write efficient imaging code for any CPU, GPU or DSP target with Halide support, without detailed knowledge about any of these architectures  ... 
doi:10.1145/3106343 fatcat:lhwzjx4levafvbbxxl4mrzn2o4

Halide

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe
2013 Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13  
Image processing pipelines combine the challenges of stencil computations and stream programs.  ...  We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline  ...  Acknowledgments Eric Chan provided feedback and inspiration throughout the design of Halide, and helped compare our local Laplacian filters implementation to his in Camera Raw.  ... 
doi:10.1145/2491956.2462176 dblp:conf/pldi/Ragan-KelleyBAPDA13 fatcat:tr3fzvh5arbbbo4nn2iqpivdaa

Halide

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe
2013 SIGPLAN notices  
Image processing pipelines combine the challenges of stencil computations and stream programs.  ...  We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline  ...  Acknowledgments Eric Chan provided feedback and inspiration throughout the design of Halide, and helped compare our local Laplacian filters implementation to his in Camera Raw.  ... 
doi:10.1145/2499370.2462176 fatcat:afs2mud2unentdmcazyg2qhiqq

Scalable analysis of Big pathology image data cohorts using efficient methods and high-performance computing strategies

Tahsin Kurc, Xin Qi, Daihou Wang, Fusheng Wang, George Teodoro, Lee Cooper, Michael Nalisnik, Lin Yang, Joel Saltz, David J. Foran
2015 BMC Bioinformatics  
Conclusions: Our work demonstrates efficient CBIR algorithms and high performance computing can be leveraged for efficient analysis of large microscopy images to meet the challenges of clinically salient  ...  Results: The proposed tools and methods take advantage of state-of-the-art parallel machines and efficient content-based image searching strategies.  ...  Contract OCI-0910735, and the Nautilus system at the University of Tennessee's Center for Remote Data Analysis and Visualization supported by NSF Award ARRA-NSF-OCI-0906324.  ... 
doi:10.1186/s12859-015-0831-6 pmid:26627175 pmcid:PMC4667532 fatcat:lnnhszkk4vhi7d3yoi74t37e2m

Whale: Efficient Giant Model Training over Heterogeneous GPUs [article]

Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, Wei Lin
2022 arXiv   pre-print
We present Whale, a general and efficient distributed training framework for giant models.  ...  The Whale runtime utilizes those annotations and performs graph optimizations to transform a local deep learning DAG graph for distributed multi-GPU execution.  ...  We would also like to thank the M6 team and all users of Whale for their help and suggestions.  ... 
arXiv:2011.09208v3 fatcat:zetefqb6o5htlhgp7gajhubxuy
« Previous Showing results 1 — 15 out of 2,621 results