Gauge-SURF descriptors

Pablo F. Alcantarilla, Luis M. Bergasa, Andrew J. Davison
2013 Image and Vision Computing  
In this paper, we present a novel family of multiscale local feature descrip-1 tors, a theoretically and intuitively well justified variant of SURF which is 2 straightforward to implement but which nevertheless is capable of demon-3 strably better performance with comparable computational cost. Our family 4 of descriptors, called Gauge-SURF (G-SURF), are based on second-order 5 multiscale gauge derivatives. While the standard derivatives used to build a 6 SURF descriptor are all relative to a
more » ... ngle chosen orientation, gauge deriva-7 tives are evaluated relative to the gradient direction at every pixel. Like 8 standard SURF descriptors, G-SURF descriptors are fast to compute due to 9 the use of integral images, but have extra matching robustness due to the 10 extra invariance offered by gauge derivatives. We present extensive experi-11 mental image matching results on the Mikolajczyk and Schmid dataset which 12 show the clear advantages of our family of descriptors against first-order lo-13 cal derivatives based descriptors such as: SURF, Modified-SURF (M-SURF) 14 and SIFT, in both standard and upright forms. In addition, we also show ex-15 perimental results on large-scale 3D Structure from Motion (SfM) and visual 16 categorization applications. 17 Given two images of the same scene, image matching is the problem of 19 establishing correspondence and is a core component of all sorts of computer 20 vision systems, particularly in classic problems such as Structure from Mo-21 tion (SfM) [1], visual categorization [2] or object recognition [3]. There has 22 been a wealth of work in particular on matching image keypoints, and the 23 key advances have been in multiscale feature detectors and invariant descrip-24 tors which permit robust matching even under significant changes in viewing 25 conditions. 26 We have studied the use of gauge coordinates [4] for image matching and 27 SfM applications and incorporated them into a Speeded-Up Robust Features 28 (SURF) [5] descriptor framework to produce a family of descriptors of dif-29 ferent dimensions which we call Gauge-SURF (G-SURF) descriptors. With 30 gauge coordinates, every pixel in the image is described in such a way that 31 if we have the same 2D local structure, the description of the structure is 32 always the same, even if the image is rotated. This is possible since multi-33 scale gauge derivatives are rotation and translation invariant. In addition, 34 gauge derivatives play a key-role in the formulation of non-linear diffusion 35 processes, as will be explained in Section 3.1. By using gauge derivatives, 36 we can make blurring locally adaptive to the image itself, without affecting 37 image details. 38 The G-SURF descriptors are very related to non-linear diffusion [6, 7] 39 processes in image processing and computer vision. In the typical Gaussian 40 2 scale-space [8] framework, details are blurred during evolution (i.e. the con-41 volution of the original image with Gaussian kernels of increasing standard 42 deviation). The advantage of blurring is the removal of noise, but relevant 43 image structures like edges are blurred and drift away from their original lo-44 cations during evolution. In general, a good solution should be to make the 45 blurring locally adaptive to the image yielding the blurring of noise, while 46 retaining details or edges. Instead of local first-order spatial derivatives, G-47 SURF descriptors measure per pixel information about image blurring and 48 edge or detail enhancing, resulting in a more discriminative descriptors. 49 We have obtained notable results in an extensive image matching evalua-50 tion using the standard evaluation framework of Mikolajczyk and Schmid [9]. 51 In addition, we have tested our family of descriptors in large-scale 3D SfM 52 datasets [10] and visual categorization experiments [2] with satisfactory re-53 sults. Our results show that G-SURF descriptors outperform or approximate 54 state of the art methods in accuracy while exhibiting low computational de-55 mands making it suitable for real-time applications. 56 We are interested in robust multiscale feature descriptors, to reliably 57 match two images in real-time for visual odometry [11] and large-scale 3D 58 SfM [10] applications. Image matching here, is in fact a difficult task to solve 59 due to the large motion between frames and the high variability of camera 60 movements. For this purpose, we need desciptors that are fast to compute 61 and at the same time exhibit high performance. 62 In addition, we have created an open-source library called OpenGSURF 63 that contains all the family of G-SURF descriptors and we plan to make it 64 publicly available. This family of descriptors comprises several descriptors 65 3 of different dimensions based on second-order multiscale gauge derivatives. 66 Depending on the application some descriptors may be preferred instead of 67 others. For example, for real-time applications a low-dimensional descriptor 68 should be preferred instead of a high-dimensional one, whereas for image-69 matching applications considering severe image transformations one can ex-70 pect a higher recall by using high-dimensional descriptors. To the best of our 71 knowledge, this is the first open source library that allows the user to choose 72 between different dimensional descriptors. Current open source descriptors 73 libraries [12, 13] just have implementations for the standard SURF and Scale 74 Invariant Feature Transform (SIFT) [14] descriptors' default dimensions (64 75 and 128 respectively). This can be a limitation and a computational bot-76 tleneck for some real-time applications that do not necessarily need those 77 default descriptor dimensions. 78 The rest of the paper is organized as follows: Related work is described in 79 Section 2. Gauge coordinates are introduced in Section 3 and the importance 80 of gauge derivatives in non-linear diffusion schemes is reviewed in Section 3.1. 81 Then we briefly discuss SURF based descriptors in Section 4. The overall 82 framework of our family of descriptors is explained in Section 5. Finally, we 83 show extensive experimental results in image matching, large-scale 3D SfM 84 and visual categorization applications in Section 6. 85 2. Related Work 86 The highly influential SIFT [14] features have been widely used in applica-87 tions from mobile robotics to object recognition, but are relatively expensive 88 to compute and are not suitable for some applications with real-time de-89 4 mands. Inspired by SIFT, Bay et al. [5] proposed SURF features, which 90 define both a detector and a descriptor. SURF features exhibit better re-91 sults than previous schemes with respect to repeatability, distinctiviness and 92 robustness, but at the same time can be computed much faster thanks to the 93 use of integral images [15]. Recently, Agrawal et al. [16] proposed some mod-94 ifications of SURF in both the detection and description steps. They intro-95 duced Center Surround Extremas (CenSurE) features and showed that they 96 outperform previous detectors and have better computational characteristics 97 for real-time applications. Their variant of the SURF descriptor, Modified-98 SURF (M-SURF), efficiently handles the descriptor boundaries problem and 99 uses a more intelligent two-stage Gaussian weighting scheme in contrast to 100 the original implementation which uses a single Gaussian weighting step. 101 All the mentioned approaches rely on the use of the Gaussian scale-102 space [8] framework to extract features at different scales. An original image 103 is blurred by convolution with Gaussian kernels of successively large standard 104 deviation to identify features at increasingly large scales. The main drawback 105 of the Gaussian kernel and its set of partial derivatives is that both interest-106 ing details and noise are blurred away to the same degree. It seems to be 107 more appropriate in feature description to make blurring locally adaptive to 108 the image data so that noise will be blurred, while at the same time details 109 or edges will remain unaffected. In this way, we can increase distinctiveness 110 when describing an image region at different scale levels. In spirit, non-linear 111 diffusion shares some similarities with the geometric blur proposed by Berg 112 and Malik [17], in where the the amount of Gaussian blurring is proportional 113 to the distance from the point of interest. 114 115 local invariants has previously been studied in the literature. In [18], Schmid 116 and Mohr used the family of local invariants known as local jet [19] for image 117 matching applications. Their descriptor vector contained 8 invariants up to 118 third order for every point of interest in the image. This work represented a 119 step forward over previous invariant recognition schemes [20]. In [9], Mikola-120 jczyk and Schmid compared the performance of the local jet (with invariants 121 up to third order) against other descriptors such as steerable filters [21], im-122 age moments [22] or SIFT. In their experiments the local jet exhibits poor 123 performance compared to SIFT. We hypothesize that this poor performance 124 is due to the fixed settings used in the experiments, such as a fixed image 125 patch size and a fixed Gaussian derivative scale. In addition, invariants of 126 high order are more sensitive to geometric and photometric distortions than 127 first-order methods. In [23], the local jet was again used for matching ap-128 plications, and they showed that even a descriptor vector of dimension 6 129 can outperfom SIFT for small perspective changes. By a suitable scaling 130 and normalization, the authors obtained invariance to spatial zooming and 131 intensity scaling. Although these results were encouraging, a more detailed 132 comparison with other descriptors would have been desirable. However, this 133 work motivated us to incorporate gauge invariants into the SURF descriptor 134 framework. 135 Brown et al. [10], proposed a framework for learning discriminative local 136 dense image descriptors from training data. The training data was obtained 137 from large-scale real 3D SfM scenarios, and accurate ground truth corre-138 spondences were generated by means of multi-view stereo matching tech-139 157 features of different sizes. Raw image derivatives can only be computed in 158 terms of the Cartesian coordinate frame x and y, so in order to obtain gauge 159
doi:10.1016/j.imavis.2012.11.001 fatcat:7l7u3ogwknbi7on62nq3slpf5i