A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Dense Image Representation with Spatial Pyramid VLAD Coding of CNN for Locally Robust Captioning
[article]
2016
arXiv
pre-print
The workflow of extracting features from images using convolutional neural networks (CNN) and generating captions with recurrent neural networks (RNN) has become a de-facto standard for image captioning task. However, since CNN features are originally designed for classification task, it is mostly concerned with the main conspicuous element of the image, and often fails to correctly convey information on local, secondary elements. We propose to incorporate coding with vector of locally
arXiv:1603.09046v1
fatcat:sasstlz7jfcpdgfheasgelos2y