[Paper] Visual Instance Retrieval with Deep Convolutional Networks

Ali S. Razavian, Josephine Sullivan, Stefan Carlsson, Atsuto Maki
2016 ITE Transactions on Media Technology and Applications  
This work presents simple pipelines for visual image retrieval exploiting image representations based on convolutional networks (ConvNets), and demonstrates that ConvNet image representations outperform other state-of-the-art image representations on six standard image retrieval datasets. ConvNet based image features have increasingly permeated the field of computer vision and are replacing hand-crafted features in many established application domains. Much recent work has illuminated the field
more » ... luminated the field on how to design and train ConvNets to maximize performance (Simonyan et al., 2013; Girshick et al., 2014; Zeiler & Fergus, 2014; Azizpour et al., 2014) and also how to exploit learned ConvNet representations to solve visual recognition tasks (Oquab et al., 2014; Donahue et al., 2014; Fischer et al., 2014; Razavian et al., 2014) . We have built on these findings to tackle visual image retrieval. Beside the performance, another issue for visual instance retrieval is the dimensionality and memory requirements of the image representation. Usually two separate categories are considered, for which we report the results. These are the small footprint representations encoding each image with less than 1kbytes and the medium footprint representations which have dimensionality between 10k and 100k. The small regime is required when the number of images is huge and memory is a bottleneck, while the medium regime is more useful when the number of images is less than 50k. In our pipeline for the small we extract the features for 576×576 images, and for medium we use those features combined with the spatial search method described in . Furthermore, inspired by the recent work of Chatfield et al. (2014) , we report the results also for a tiny representation (Torralba et al., 2008; Jégou et al., 2012; Arandjelović & Zisserman, 2013; Jégou & Zisserman, 2014) . We define a tiny image representation as one that takes 32bytes or less to store and is learnt independently of the test dataset. Such a compressed representation would allow large scale searches to be completed on mobile phones (Panda et al., 2013) or on the cloud (Quack et al., 2004). RESULTS SUMMARY To evaluate our model, we used two networks. First, one which we refer to as AlexNet Krizhevsky et al. (2012) is the publicly available network implemented by caffe. The second network which we call OxfordNet has the same structure as (Simonyan & Zisserman, 2015) except the network that we trained has 256 kernels at the final convolutional layers as opposed to the Oxford paper with 512 kernels. Among available alternatives we used last convolutional layers response followed by a max pooling operation as the basic representation for small regime retrieval. For tiny representation, we quantized the basic representation and for medium regime, we followed Azizpour et al. (2014) but optimized the parameters. The details of our pipeline is presented in table 1 (we extract square patches of L different sizes in the search).
doi:10.3169/mta.4.251 fatcat:kt2reu6gcvf7rgjtcpx2nsvj3a