Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards [chapter]

Richard Membarth, Hritam Dutta, Frank Hannig, Jürgen Teich
2019 Advances in Biochemical Engineering/Biotechnology  
In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, it is hard to efficiently map such algorithms to the graphics hardware even with detailed insight into the architecture. This paper presents a
more » ... image processing algorithm and shows the efficient mapping of this type of algorithms to graphics hardware as well as double buffering concepts to hide memory transfers. Furthermore, the impact of execution configuration is illustrated and a method is proposed to determine offline the best configuration. Using CUDA as programming model, it is demonstrated that the image processing algorithm is significantly accelerated and that a speedup of more than 145× can be achieved on NVIDIA's Tesla C1060 compared to a parallelized implementation on a Xeon Quad Core. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead to the graphics card is reduced by a factor of six using double buffering. Since the hardware configuration varies for different GPUs, also the best block configuration changes. Therefore, we propose a method that allows to use always the best configuration for GPUs at run-time. We explore the configuration space for each graphics card model offline and store the result in a database. Later at run-time, the program identifies the model of the GPU and uses the configuration retrieved from the database. In that way there is no overhead at run-time and there is no penalty when a different GPU is used. In addition, the binary code size can be kept nearly as small as the original binary size. Double Buffering Support application mapping final application double buffering support (device driver) communication support computation kernel communication support computation kernel application
doi:10.1007/978-3-662-58834-5_1 fatcat:tz3azu6gb5hj5mysy2whv7grui