Stereoscopic video quality assessment based on 3D convolutional neural networks

Jiachen Yang, Yinghao Zhu, Chaofan Ma, Wen Lu, Qinggang Meng
2018 Neurocomputing  
Keywords: 3D convolutional neural networks Stereoscopic video quality assessment Quality score fusion a b s t r a c t The research of stereoscopic video quality assessment (SVQA) plays an important role for promoting the development of stereoscopic video system. Existing SVQA metrics rely on hand-crafted features, which is inaccurate and time-consuming because of the diversity and complexity of stereoscopic video distortion. This paper introduces a 3D convolutional neural networks (CNN) based
more » ... QA framework that can model not only local spatio-temporal information but also global temporal information with cubic difference video patches as input. First, instead of using hand-crafted features, we design a 3D CNN architecture to automatically and effectively capture local spatio-temporal features. Then we employ a quality score fusion strategy considering global temporal clues to obtain final video-level predicted score. Extensive experiments conducted on two public stereoscopic video quality datasets show that the proposed method correlates highly with human perception and outperforms state-of-the-art methods by a large margin. We also show that our 3D CNN features have more desirable property for SVQA than hand-crafted features in previous methods, and our 3D CNN features together with support vector regression (SVR) can further boost the performance. In addition, with no complex preprocessing and GPU acceleration, our proposed method is demonstrated computationally efficient and easy to use. (W. Lu). Unfortunately, considering that reference video is unavailable in most practical applications, only the NR methods have potential to satisfy the actual requirement. As a result, our work focuses on more appealing and challenging NR methods, and tries to propose a new general-purpose NR framework for SVQA. Considering the development of NR metrics [1,2] , we can find that most of them have similar frameworks, which can be generally divided into two steps: (1) extracts features that can reflect visual quality based on relatively perceptual models; (2) maps the obtained feature vectors to subjective quality scores by learning regression models. But in real-world scenarios, because of the diversity and complexity of video distortions, it is very difficult to identify what features are sensitive and robust to all sorts of distortions. Therefore, simply utilizing a group of artificially designed features to represent video quality results in inaccurate assessments and high computational costs. Recently, it is a well-known fact that deep learning models, especially convolutional neural networks (CNN), have achieved great success in many challenge computer vision tasks, such as image classification [3, 4] , object detection [5, 6] , video classification [7, 8] and video action recognition [9, 10] . CNN is a biologically inspired architecture consisting of a stack of convolutional layers
doi:10.1016/j.neucom.2018.04.072 fatcat:2axcnt7gd5d3no2urbvudgpec4