Visual saliency prediction based on deep learning

Bashir Ghariba
2020
The Human Visual System (HVS) has the ability to focus on specific parts of a scene, rather than the whole image. Human eye movement is also one of the primary functions used in our daily lives that helps us understand our surroundings. This phenomenon is one of the most active research topics in the computer vision and neuroscience fields. The outcomes that have been achieved by neural network methods in a variety of tasks have highlighted their ability to predict visual saliency. In
more » ... , deep learning models have been used for visual saliency prediction. In this thesis, a deep learning method based on a transfer learning strategy is proposed (Chapter 2), wherein visual features in the convolutional layers are extracted from raw images to predict visual saliency (e.g., saliency map). Specifically, the proposed model uses the VGG-16 network (i.e., Pre-trained CNN model) for semantic segmentation. The proposed model is applied to several datasets, including TORONTO, MIT300, MIT1003, and DUT-OMRON, to illustrate its efficiency. The results of the proposed model are then quantitatively and qualitatively compared to classic and state-of-the-art deep learning models. In Chapter 3, I specifically investigate the performance of five state-of-the-art deep neural networks (VGG-16, ResNet-50, Xception, InceptionResNet-v2, and MobileNet-v2) for the task of visual saliency prediction. Five deep learning models were trained over the SALICON dataset and used to predict visual saliency maps using four standard datasets, namely TORONTO, MIT300, MIT1003, and DUT-OMRON. The results indicate that the ResNet-50 model outperforms the other four and provides a visual saliency map that is very close to human performance. In Chapter 4, a novel deep learning model based on a Fully Convolutional Network (FCN) architecture is proposed. The proposed model is trained in an end-to-end style and designed to predict visual saliency. The model is based on the encoder-decoder structure and includes two types of modules. The first has three [...]
doi:10.48336/bzj8-n176 fatcat:gbladtscczhhnezlp4lxvnwf2u