Filters








7,681 Hits in 7.4 sec

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers [article]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
2020 arXiv   pre-print
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework.  ...  , Natural Language for Visual Reasoning for Real (NLVR).  ...  Existing methods that use region-based visual features and language embedding as input of Transformer for cross-modality joint learning are limited to the visual semantics represented by the visual features  ... 
arXiv:2004.00849v2 fatcat:5ccgm6lrmfdn7kjkbvfp7tiq2m

Weakly Supervised Video Moment Retrieval From Text Queries [article]

Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury
2019 arXiv   pre-print
We propose a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions.  ...  In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval.  ...  Most of the recent methods for image-text retrieval task focus on learning joint visual-semantic embedding models [13, 15, 7, 36, 6, 24, 34, 23 ].  ... 
arXiv:1904.03282v2 fatcat:5qithwolavfwpofawe232b6pzi

Weakly Supervised Video Moment Retrieval From Text Queries

Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We propose a joint visualsemantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions.  ...  In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval.  ...  Most of the recent methods for image-text retrieval task focus on learning joint visual-semantic embedding models [13, 15, 7, 36, 6, 24, 34, 23 ].  ... 
doi:10.1109/cvpr.2019.01186 dblp:conf/cvpr/MithunPR19 fatcat:fv7y4dhnxrhvjdrf4c2w7mm5nm

A Joint Sequence Fusion Model for Video Question Answering and Retrieval [article]

Youngjae Yu, Jongseok Kim, Gunhee Kim
2018 arXiv   pre-print
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence).  ...  Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor.  ...  We thank Jisung Kim and Antoine Miech for helpful comments about the model. This research was supported by Brain Research Program by National Research Foundation of Korea (NRF) (2017M3C7A1047860).  ... 
arXiv:1808.02559v1 fatcat:dvcj652bejckvfx7egrr5c4zmm

Saliency-Guided Attention Network for Image-Sentence Matching [article]

Zhong Ji, Haoran Wang, Jungong Han, Yanwei Pang
2021 arXiv   pre-print
attention modules to learn the fine-grained correlation intertwined between vision and language.  ...  This paper studies the task of matching image and sentence, where learning appropriate representations across the multi-modal data appears to be the main challenge.  ...  Related Work Visual-semantic Embedding Based Image-Sentence Matching The core idea of most existing studies [7, 36, 22, 38, 27, 26, 46, 21, 8] for matching image and sentence can be boiled down to  ... 
arXiv:1904.09471v4 fatcat:zmukwaqu2ja5do7mgr3cxev3yi

Language Guided Networks for Cross-modal Moment Retrieval [article]

Kun Liu, Huadong Ma, Chuang Gan
2020 arXiv   pre-print
In this paper, we present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.  ...  In the first feature extraction stage, we propose to jointly learn visual and language features to capture the powerful visual information which can cover the complex semantics in the sentence query.  ...  We also demonstrate the Figure 2 : The proposed overall framework for cross-modal moment retrieval. We propose to leverage the sentence embedding to guide the whole process for retrieving moments.  ... 
arXiv:2006.10457v2 fatcat:eqnjjsvpvfc2darsrkayd3qvxm

Multi-modal Memory Enhancement Attention Network for Image-Text Matching

Zhong Ji, Zhigang Lin, Haoran Wang, Yuqing He
2020 IEEE Access  
Image-text matching is an attractive research topic in the community of vision and language.  ...  Furthermore, considering the usage of long-term contextual knowledge contributes to compensate for detailed semantics concealed in the rarely appeared image-text pairs, we present to learn the joint representations  ...  that benefits the visual-semantic embedding.  ... 
doi:10.1109/access.2020.2975594 fatcat:ciiubythzzevpkw2ip5csnjwf4

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [article]

Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu
2021 arXiv   pre-print
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs.  ...  As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.  ...  The dynamic updating mechanism for VD can capture text-guided semantics from the vision-language dataset.  ... 
arXiv:2104.03135v2 fatcat:ipergnpirzhblnwg2epmptmasa

Joint Visual-Textual Embedding for Multimodal Style Search [article]

Gil Sadeh, Lior Fritz, Gabi Shalev, Eduard Oks
2019 arXiv   pre-print
This joint visual-textual embedding space enables manipulating catalog images semantically, based on textual refinement requirements.  ...  We introduce a multimodal visual-textual search refinement method for fashion garments.  ...  A Mini-Batch Match Retrieval (MBMR) loss, L M BM R , for the task of learning a joint embedding space, and a multi-label cross-entropy loss, L a , for attribute extraction.  ... 
arXiv:1906.06620v1 fatcat:hma2wxq4rfaurov7s53zwovmay

Object-aware Video-language Pre-training for Retrieval [article]

Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou
2022 arXiv   pre-print
Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval.  ...  Yet, existing video-language transformer models do not explicitly fine-grained semantic align.  ...  Introduction Learning scalable video-text representations for retrieval requires the understanding of both visual and textual clues, as well as the semantic alignment between these two modalities.  ... 
arXiv:2112.00656v6 fatcat:dll3zlr4n5fl3loxw5pivvubgu

Dual Attention Networks for Multimodal Reasoning and Matching [article]

Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim
2017 arXiv   pre-print
Our extensive experiments validate the effectiveness of DANs in combining vision and language, achieving the state-of-the-art performance on public benchmarks for VQA and image-text matching.  ...  Based on this framework, we introduce two types of DANs for multimodal reasoning and matching, respectively.  ...  This approach eventually finds a joint embedding space which facilitates efficient cross-modal matching and retrieval.  ... 
arXiv:1611.00471v2 fatcat:kfctepmv5bccbkinko55js6kji

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval [article]

Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, Amit K. Roy-Chowdhury
2018 arXiv   pre-print
Specifically, our main idea is to leverage web images and corresponding tags, along with fully annotated datasets, in training for learning the visual-semantic joint embedding.  ...  We propose a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding  ...  We thank Sujoy Paul for helpful suggestions and Victor Hill for setting up the computing infrastructure used in this work.  ... 
arXiv:1808.07793v1 fatcat:fnhi4dwozjcrljclj46k7bviha

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation [article]

Chen Liang, Yu Wu, Yawei Luo, Yi Yang
2022 arXiv   pre-print
Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation.  ...  It essentially requires semantic comprehension and fine-grained video understanding.  ...  ., positional relation, text-guided semantic relation, and temporal relation. We then utilize linguistic embedding to retrieve the final prediction. Figure 3 . 3 Figure 3.  ... 
arXiv:2103.10702v3 fatcat:nmkubjdazvfrtpzx6ldtmzveia

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language [article]

Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang
2020 arXiv   pre-print
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions.  ...  It then aligns these visual features with the textual attributes parsed from the sentences by using a novel contrastive learning loss.  ...  Works in vision and language propagate the notion of visual semantic embedding, with a goal to learn joint feature space for both visual inputs and their correspondent textual annotations [10, 53] .  ... 
arXiv:2005.07327v2 fatcat:6eww5ur4uzbvvhrmgny5jknusu

Enhancing Video Summarization via Vision-Language Embedding

Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
objectives computed on features from a joint vision-language embedding space.  ...  Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video.  ...  Acknowledgements: We would like to thank Emily Fortuna and Aseem Agarwala for discussions and feedback on this work.  ... 
doi:10.1109/cvpr.2017.118 dblp:conf/cvpr/PlummerBL17 fatcat:m3pmjulzaradhknim4kdgh2bk4
« Previous Showing results 1 — 15 out of 7,681 results