Boosting Cross-modal Retrieval with MVSE++ and Reciprocal Neighbors

Wei Wei, Mengmeng Jiang, Xiangnan Zhang, Heng Liu, Chunna Tian
2020 IEEE Access  
In this paper, we propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. First, we propose a multi-feature based visualsemantic embedding (MVSE++) space to retrieve the candidates in another modality, which provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we have more potential to find a more accurate and detailed caption for the image. However,
more » ... tioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the k nearest reciprocal neighbors in MVSE++ space. The method is evaluated on the benchmark datasets of MSCOCO and Flickr30K with standard metrics. We achieve highe accuracy in caption retrieval and image retrieval at both R@1 and R@10. INDEX TERMS Cross-modal retrieval, visual-semantic embedding, scene context, reciprocal neighbors, re-ranking method. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 8, 2020
doi:10.1109/access.2020.2992187 fatcat:dhtacsefubhyvopltkjtudqrgi