Parallel Deblocking Filtering in MPEG-4 AVC/H.264 on Massively Parallel Architectures

Bart Pieters, Charles-Frederik J. Hollemeersch, Jan De Cock, Peter Lambert, Wesley De Neve, Rik Van de Walle
2011 IEEE transactions on circuits and systems for video technology (Print)  
The deblocking filter in the MPEG-4 AVC/H.264 standard is computationally complex because of its high content adaptivity, resulting in a significant number of data dependencies. These data dependencies interfere with parallel filtering of multiple macroblocks on massively-parallel architectures. In this paper, we introduce a novel macroblock partitioning scheme for concurrent deblocking in the MPEG-4 AVC/H.264 standard, based on our idea of Deblocking Filter Independency, a corrected version of
more » ... the Limited Error Propagation Effect proposed in the literature. Our proposed scheme enables concurrent macroblock deblocking of luma samples with limited synchronization effort, independently of slice configuration, and is compliant with the MPEG-4 H.264/AVC standard. We implemented the method on the massively-parallel architecture of the Graphics Processing Unit (GPU). Experimental results show that our GPU implementation achieves faster-than real-time deblocking at 1309 frames per second for 1080p video pictures. Both software-based deblocking filters and state-of-the-art GPU-enabled algorithms are outperformed in terms of speed by factors up to 10.2 and 19.5 respectively for 1080p video pictures. Index Terms-deblocking, GPU, MPEG-4 AVC/H.264, in-loop filtering, massively-parallel I. INTRODUCTION T HE in-loop deblocking filter in the MPEG-4 AVC/H.264 video coding standard [1] is designed to reduce blocking artifacts caused by quantization. The filter is highly contentadaptive, resulting in increased filter efficiency, but also in increased computational complexity [2] . This computational complexity is mainly due to the conditional processing of block edges and the interdependencies of successive filtering steps. Edge filtering modifies samples by complex filters using up to five taps. These can occur over slice and macroblock boundaries, introducing dependencies between filtered edges which interfere with parallel execution. Therefore, most deblocking algorithms proposed in the literature are aimed at pipelined [3] or serial [4]-[6] processing of macroblocks.
doi:10.1109/tcsvt.2011.2105553 fatcat:5n5fi54rx5expmleo3b3uklaz4