Multimodal Disentanglement Variational AutoEncoders for Zero-Shot Cross-Modal Retrieval

Jialin Tian, Kai Wang, Xing Xu, Zuo Cao, Fumin Shen, Heng Tao Shen
2022 Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval  
Zero-Shot Cross-Modal Retrieval (ZS-CMR) has recently drawn increasing attention as it focuses on a practical retrieval scenario, i.e., the multimodal test set consists of unseen classes that are disjoint with seen classes in the training set. The recently proposed methods typically adopt the generative model as the main framework to learn a joint latent embedding space to alleviate the modality gap. Generally, these methods largely rely on auxiliary semantic embeddings for knowledge transfer
more » ... ross classes and unconsciously neglect the effect of the data reconstruction manner in the adopted generative model. To address this issue, we propose a novel ZS-CMR model termed Multimodal Disentanglement Variational AutoEncoders (MD-VAE), which consists of two coupled disentanglement variational autoencoders (DVAEs) and a fusion-exchange VAE (FVAE). Specifically, DVAE is developed to disentangle the original representations of each modality into modality-invariant and modality-specific features. FVAE is designed to fuse and exchange information of multimodal data by the reconstruction and alignment process without pre-extracted semantic embeddings. Moreover, an advanced counter-intuitive cross-reconstruction scheme is further proposed to enhance the informativeness and generalizability of the modalityinvariant features for more effective knowledge transfer. The comprehensive experiments on four image-text retrieval and two imagesketch retrieval datasets consistently demonstrate that our method establishes the new state-of-the-art performance.
doi:10.1145/3477495.3532028 fatcat:axwwf2jxufbw5n75n4kooacery