Memory Performance Optimizations For Real-Time Software HDTV Decoding

Han Chen, Kai Li, Bin Wei
2005 Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology  
Pure software HDTV video decoding is still a challenging task on entry-level to mid-range desktop and notebook PCs, even with today's microprocessors frequency measured in GHz. This paper shows that the performance bottleneck in a software MPEG-2 decoder has been shifted to memory operations, as microprocessor technologies including multimedia instruction extensions have been improving at a fast rate during the past years. 10 11 12 13 Our study exploits concurrencies at macroblock level to
more » ... iate the performance bottleneck in a software MPEG-2 decoder. First, the paper introduces an interleaved block-order data layout to improve CPU cache performance. Second, the paper describes an algorithm to explicitly prefetch macroblocks for motion compensation. Finally, the paper presents an algorithm to schedule interleaved decoding and output at macroblock level. Our implementation and experiments show that these methods can effectively hide the latency of memory and frame buffer. The optimizations improve the performance of a multimedia-instruction-optimized software MPEG-2 decoder by a factor of about two. On a PC with a 933 MHz Pentium III CPU, the decoder can decode and display 1280 × 720-resolution HDTV streams at over 62 frames per second. has been improving at a much slower rate than the mi-43 croprocessor during the past decades, the performance 44 bottleneck of a software decoder has now been shifted 45 to memory operations. 46 To understand the extent of the problem, we analyzed 47 the distribution of Cycles-Per-Instruction (CPI) [6] of 48 a software MPEG-2 decoder optimized by extensive 49 use of MultiMedia eXtension (MMX) and Streaming 50 SIMD Extensions (SSE) instructions [7], we found that 51 the stalling of memory operations increases the CPI sig-52 nificantly in memory-intensive functions. On a PC with 53 a 933 MHz Pentium III CPU, the average CPI of mo-54 tion compensation is 1.81 and that of display is 10.57. 55 These are several times more than the average CPI of 56 0.57 for the computation-intensive IDCT functions. 57 Our approach to solving the memory performance 58 bottleneck problem is to exploit the concurrency be-59 tween the CPU and the memory sub-system in a mod-60 ern computer. We first introduce a new frame buffer 61 layout, called Interleaved Block-Order (IBO), for the 62 software MPEG-2 decoder to improve the CPU's cache 63 performance. We then describe an algorithm to explic-64 itly prefetch macroblocks for motion compensation. Fi-65 nally, we present an algorithm to schedule interleaved 66 decoding and output at macroblock level. 67 We implemented our proposed methods on a PC plat-68 form that has a 933 MHz Pentium III CPU. Our tests 69 with several DVD and HDTV streams show that the 70 optimizations improve the performance of a software 71 decoder already extensively optimized with multime-72 dia instructions by another factor of two. Our optimiza-73 tions successfully reduce the CPIs of memory-intensive 74 functions. The CPI of motion compensation functions 75 is reduced to 0.7 and the CPI of display function is 76 reduced to 1.07. As a result, the improved software de-77 coder decodes and displays 720p (1280 × 720) format 78 HDTV streams at over 62 frames per second. 79
doi:10.1007/s11265-005-6650-7 fatcat:ak3cwkplijc35k3ez6oukibfta