An optimized FPGA design of inverse quantization and transform for HEVC decoding blocks and validation in an SW/HW environment

2020 Turkish Journal of Electrical Engineering and Computer Sciences  
This paper presents an optimized hardware architecture of the Inverse Quantization and the Inverse 4 Transform (IQ/IT) for a High Efficiency Video Coding (HEVC) decoder. Our highly parallel and pipelined architecture 5 was designed to support all HEVC Transform Unit (TU) sizes: 4x4, 8x8, 16x16 and 32x32. The IQ/IT was described 6 in the VHDL language and synthesized to Xilinx XC7Z020 FPGA and to TSMC 180 nm standard-cell library. The 7 throughput of the hardware architecture reached in the
more » ... reached in the worst case a processing rate of up to 1080p@33fps at 146 MHz 8 and 1080p@25fps at 110 MHz when mapped to FPGA and standard-cells, respectively. The validation of our architecture 9 was conducted on the ZC702 platform using a Software/Hardware (SW/HW) environment in order to evaluate different 10 implementation methods (SW and SW/HW) in terms of power consumption and run-time. The experimental results 11 demonstrate that the SW/HW accelerations were enhanced by more than 70% in terms of the run-time speed relative 12 to the SW solution. Besides, the power consumption of the SW/HW designs was reduced by nearly 60% compared with 13 the SW case. 14 Key words: HEVC decoder, IDCT, inverse quantization, SW/HW environnement, FPGA 15 1. Introduction 16 HEVC is one of the new video coding standards developed by the ITU [1] and the ISO/IEC [2], specially 17 designed to address all the crippling limitations of the H.264/AVC standard. In comparison to the AVC, HEVC 18 [3] [4] provides a better coding efficiency, including the reduction of bitrate by half at the same picture quality 19 with a loss of 3 times in coding blocks complexity. In this paper, we focus particularly on the HEVC decoder 20 layer which is based on the same structure of the H.264/AVC with some improvements in each coding step. 21 The different decoding steps are illustrated in Figure 1 below: 22 The HEVC coding structure replaces the traditional macroblock by a quadtree-based block partition 23 based on Coding Tree Block (CTB). The latter can be divided into variable units called Large Coding Units 24 (LCUs) of sizes 16x16 to 64x64 instead of 16x16 only. In turn, each LCU can be split recursively into Coding 25 Units (CUs) with each of them representing the basic unit. At this level, a decision is made to fix whatever Figure 1. HEVC decoder layer. operated at the level of Prediction Units (PU), the sizes of which vary from 4x4 to 64x64. Depending on the PU 1 partition adopted, inverse quantization and transform (IQ/IT) algorithms are performed at the level of those 2 Transform Units (TUs) whose size can be 4x4, 8x8, 16x16 and 32x32. This variable size approach particularly 3 helps to attain HD video resolutions. In fact, larger TU sizes achieve better energy compaction. Yet, they 4 increase the computational complexity exponentially. 5 However, IQ/IT are heavily used in the HEVC decoder [5]. It accounts for 18% of the computational 6 complexity of an HEVC video decoder. Since the standardization of the HEVC codec, researchers have worked 7 continuously to reduce decoder complexity by adopting hardware acceleration as a solution. In fact, Martuza 8 et al. [6] present a shared architecture that supports both 1D-IDCT 8x8 for HEVC and H.264/AVC using 9 a new mapping technique. This approach is able to decode 1080p@67fps using CMOS 180nm technology. 10 However, Chiang et al. exploit in [7] an optimized method for 2D-IDCT which generates new coefficients 11 transformed from the already calculated coefficients. This architecture has since not only reduced the occupied 12 area significantly and also met the throughput of 3840x2160@30fps using CMOS 90nm technology. But Liang 13 et al. [8] present an architecture that supports the 2D integer inverse discrete sine transform (2D-IDST) and 2D 14 integer inverse discrete cosine transform (2D-IDCT) using two 1D-IDCT/IDST units and memory block. This 15 architecture calculates 4 residual pixels in parallel in each clock cycle. This design can decode 7680x4320@30fps. 16 Furthermore, Goebel et al. [9] illustrate an efficient multi size hardware for DCT transform algorithm dedicated 17 to support all HEVC TU sizes 4x4, 8x8, 16x16 and 32x32. The synthesis results using a Nangate 45nm standard-18 cell library allow a processing rate of 1080p@30fps. In addition, Chen et al. [10] design a 2D inverse transform 19 architecture that supports all TU sizes. This architecture can compute two rows in parallel during the 1D-IDCT 20 instead of only one. In this case, the maximum throughput achieved is about 4K@54fps with the Xilinx Zynq 21 platform. Some other works such as [11], Ercan et al. propose an energy reduction technique that helps to lower 22 the computational complexity present in the IDST and IDCT algorithms for all transform core sizes. In the 23 worst of cases, this architecture can process 4K@48fps. Finally, in [12], Mohamed et al. provide a System-On-24 Chip FPGA platform based on Xilinx Zynq to integrate the DCT coding block as an accelerator. The proposed 25 design is capable to perform the coding of 1080@30fps. 26
doi:10.3906/elk-1910-122 fatcat:zsosqzkchjfubjuw7jmmxwlo6i