Fully Convolutional CaptionNet: Siamese Difference Captioning Attention Model

Ariyo Oluwasanmi, Enoch Frimpong, Muhammad Umar Aftab, Edward Y. Baagyere, Zhiguang Qin, Kifayat Ullah
2019 IEEE Access  
The generation of the textual description of the differences in images is a relatively new concept that requires the fusion of both computer vision and natural language techniques. In this paper, we present a novel Fully Convolutional CaptionNet (FCC) that employs an encoder-decoder framework to perform visual feature extractions, compute the feature distances, and generate new sentences describing the measured distances. After extracting the features of the images, a contrastive function is
more » ... d to compute their weighted L1 distance which is learned and selectively attended to determine salient sections of the feature at every time step. The attended feature region is adequately matched to corresponding words iteratively until a sentence is completed. We propose the application of upsampling network to enlarge the features' field of view, this provides a robust pixel-based discrepancy computation. Our extensive experiments indicate that the FCC model outperforms other learning models on the benchmark Spot-the-Diff datasets by generating succinct and meaningful textual differences in images. INDEX TERMS Image captioning, deep learning, Siamese network, recurrent neural network, convolutional neural network, attention, fully convolutional networks.
doi:10.1109/access.2019.2957513 fatcat:4t7nsc62tze5hjq56rmvhruzsm