Feature extraction using MPEG-CDVS and Deep Learning with application to robotic navigation and image classification

Pedro Porto Buarque De Gusmao, Enrico Magli
Finally, this work would not have been possible without the support received from my friends, my family and from my girlfriend, Elisabetta Bichiri. Thank you so much. Abstract The main contributions of this thesis are the evaluation of MPEG Compact Descriptor for Visual Search in the context of indoor robotic navigation and the introduction of a new method for training Convolutional Neural Networks with applications to object classification. The choice for image descriptor in a visual
more » ... a visual navigation system is not straightforward. Visual descriptors must be distinctive enough to allow for correct localization while still offering low matching complexity and short descriptor size for real-time applications. MPEG Compact Descriptor for Visual Search is a low complexity image descriptor that offers several levels of compromises between descriptor distinctiveness and size. In this work, we describe how these trade-offs can be used for efficient loop-detection in a typical indoor environment. We first describe a probabilistic approach to loop detection based on the standard's suggested similarity metric. We then evaluate the performance of CDVS compression modes in terms of matching speed, feature extraction, and storage requirements and compare them with the state of the art SIFT descriptor for five different types of indoor floors. During the second part of this thesis we focus on the new paradigm to machine learning and computer vision called Deep Learning. Under this paradigm visual features are no longer extracted using fine-grained, highly engineered feature extractor, but rather using a Convolutional Neural Networks (CNN) that extracts hierarchical features learned directly from data at the cost of long training periods. In this context, we propose a method for speeding up the training of Convolutional Neural Networks (CNN) by exploiting the spatial scaling property of convolutions. This is done by first training a pre-train CNN of smaller kernel resolutions for a few epochs, followed by properly rescaling its kernels to the target's original dimensions and continuing training at full resolution. We show that the overall training time of a target CNN architecture can be reduced by exploiting the spatial scaling property of vi convolutions during early stages of learning. Moreover, by rescaling the kernels at different epochs, we identify a trade-off between total training time and maximum obtainable accuracy. Finally, we propose a method for choosing when to rescale kernels and evaluate our approach on recent architectures showing savings in training times of nearly 20% while test set accuracy is preserved.
doi:10.6092/polito/porto/2665943 fatcat:xe6g2rue5fgphehdetemdhviva