2 Hits in 3.3 sec

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts [article]

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Furu Wei
2022 arXiv   pre-print
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network.  ...  Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer.  ...  • We are going to explore to what extent vision-language pre-training can help each other modality, especially as the shared MOME backbone naturally blends in text and image representations. • We can extend  ... 
arXiv:2111.02358v2 fatcat:crzh75jj3rhgtglzf2wbp26c7i

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [article]

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang (+3 others)
2022 arXiv   pre-print
of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives.  ...  Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment.  ...  To combine the benefits of both categories of architectures, VLMo [20] further unifies the dual encoder and fusion encoder modules with shared mixture-of-modality-experts Transformer.  ... 
arXiv:2205.12005v2 fatcat:cck3km3syjdytc5so2gzglucni