Joint Image-Text Representation by Gaussian Visual-Semantic Embedding

Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, Alan Yuille
2016 Proceedings of the 2016 ACM on Multimedia Conference - MM '16  
How to jointly represent images and texts is important for tasks involving both modalities. Visual-semantic embedding models have been recently proposed and shown to be effective. The key idea is that by learning a mapping from images into a semantic text space, the algorithm is able to learn a compact and effective joint representation. However, existing approaches simply map each text concept to a single point in the semantic space. Mapping instead to a density distribution provides many
more » ... esting advantages, including better capturing uncertainty about each text concept, and enabling better geometric interpretation of concepts such as inclusion, intersection, etc. In this work, we present a novel Gaussian Visual-Semantic Embedding (GVSE) model, which leverages the visual information to model text concepts as Gaussian distributions in semantic space. Experiments in two tasks, image classification and text-based image retrieval on the large scale MIT Places205 dataset, have demonstrated the superiority of our method over existing approaches, with higher accuracy and better robustness.
doi:10.1145/2964284.2967212 dblp:conf/mm/RenJLFY16 fatcat:i3pc7yh64raerdyndksx4vvaee