A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
[article]
2020
arXiv
pre-print
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. ...
, Natural Language for Visual Reasoning for Real (NLVR). ...
Existing methods that use region-based visual features and language embedding as input of Transformer for cross-modality joint learning are limited to the visual semantics represented by the visual features ...
arXiv:2004.00849v2
fatcat:5ccgm6lrmfdn7kjkbvfp7tiq2m
Weakly Supervised Video Moment Retrieval From Text Queries
[article]
2019
arXiv
pre-print
We propose a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. ...
In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. ...
Most of the recent methods for image-text retrieval task focus on learning joint visual-semantic embedding models [13, 15, 7, 36, 6, 24, 34, 23 ]. ...
arXiv:1904.03282v2
fatcat:5qithwolavfwpofawe232b6pzi
Weakly Supervised Video Moment Retrieval From Text Queries
2019
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We propose a joint visualsemantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. ...
In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. ...
Most of the recent methods for image-text retrieval task focus on learning joint visual-semantic embedding models [13, 15, 7, 36, 6, 24, 34, 23 ]. ...
doi:10.1109/cvpr.2019.01186
dblp:conf/cvpr/MithunPR19
fatcat:fv7y4dhnxrhvjdrf4c2w7mm5nm
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
[article]
2018
arXiv
pre-print
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). ...
Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. ...
We thank Jisung Kim and Antoine Miech for helpful comments about the model. This research was supported by Brain Research Program by National Research Foundation of Korea (NRF) (2017M3C7A1047860). ...
arXiv:1808.02559v1
fatcat:dvcj652bejckvfx7egrr5c4zmm
Saliency-Guided Attention Network for Image-Sentence Matching
[article]
2021
arXiv
pre-print
attention modules to learn the fine-grained correlation intertwined between vision and language. ...
This paper studies the task of matching image and sentence, where learning appropriate representations across the multi-modal data appears to be the main challenge. ...
Related Work
Visual-semantic Embedding Based Image-Sentence Matching The core idea of most existing studies [7, 36, 22, 38, 27, 26, 46, 21, 8] for matching image and sentence can be boiled down to ...
arXiv:1904.09471v4
fatcat:zmukwaqu2ja5do7mgr3cxev3yi
Language Guided Networks for Cross-modal Moment Retrieval
[article]
2020
arXiv
pre-print
In this paper, we present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval. ...
In the first feature extraction stage, we propose to jointly learn visual and language features to capture the powerful visual information which can cover the complex semantics in the sentence query. ...
We also demonstrate the Figure 2 : The proposed overall framework for cross-modal moment retrieval. We propose to leverage the sentence embedding to guide the whole process for retrieving moments. ...
arXiv:2006.10457v2
fatcat:eqnjjsvpvfc2darsrkayd3qvxm
Multi-modal Memory Enhancement Attention Network for Image-Text Matching
2020
IEEE Access
Image-text matching is an attractive research topic in the community of vision and language. ...
Furthermore, considering the usage of long-term contextual knowledge contributes to compensate for detailed semantics concealed in the rarely appeared image-text pairs, we present to learn the joint representations ...
that benefits the visual-semantic embedding. ...
doi:10.1109/access.2020.2975594
fatcat:ciiubythzzevpkw2ip5csnjwf4
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
[article]
2021
arXiv
pre-print
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. ...
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. ...
The dynamic updating mechanism for VD can capture text-guided semantics from the vision-language dataset. ...
arXiv:2104.03135v2
fatcat:ipergnpirzhblnwg2epmptmasa
Joint Visual-Textual Embedding for Multimodal Style Search
[article]
2019
arXiv
pre-print
This joint visual-textual embedding space enables manipulating catalog images semantically, based on textual refinement requirements. ...
We introduce a multimodal visual-textual search refinement method for fashion garments. ...
A Mini-Batch Match Retrieval (MBMR) loss, L M BM R , for the task of learning a joint embedding space, and a multi-label cross-entropy loss, L a , for attribute extraction. ...
arXiv:1906.06620v1
fatcat:hma2wxq4rfaurov7s53zwovmay
Object-aware Video-language Pre-training for Retrieval
[article]
2022
arXiv
pre-print
Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. ...
Yet, existing video-language transformer models do not explicitly fine-grained semantic align. ...
Introduction Learning scalable video-text representations for retrieval requires the understanding of both visual and textual clues, as well as the semantic alignment between these two modalities. ...
arXiv:2112.00656v6
fatcat:dll3zlr4n5fl3loxw5pivvubgu
Dual Attention Networks for Multimodal Reasoning and Matching
[article]
2017
arXiv
pre-print
Our extensive experiments validate the effectiveness of DANs in combining vision and language, achieving the state-of-the-art performance on public benchmarks for VQA and image-text matching. ...
Based on this framework, we introduce two types of DANs for multimodal reasoning and matching, respectively. ...
This approach eventually finds a joint embedding space which facilitates efficient cross-modal matching and retrieval. ...
arXiv:1611.00471v2
fatcat:kfctepmv5bccbkinko55js6kji
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
[article]
2018
arXiv
pre-print
Specifically, our main idea is to leverage web images and corresponding tags, along with fully annotated datasets, in training for learning the visual-semantic joint embedding. ...
We propose a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding ...
We thank Sujoy Paul for helpful suggestions and Victor Hill for setting up the computing infrastructure used in this work. ...
arXiv:1808.07793v1
fatcat:fnhi4dwozjcrljclj46k7bviha
ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation
[article]
2022
arXiv
pre-print
Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. ...
It essentially requires semantic comprehension and fine-grained video understanding. ...
., positional relation, text-guided semantic relation, and temporal relation. We then utilize linguistic embedding to retrieve the final prediction.
Figure 3 . 3 Figure 3. ...
arXiv:2103.10702v3
fatcat:nmkubjdazvfrtpzx6ldtmzveia
ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language
[article]
2020
arXiv
pre-print
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. ...
It then aligns these visual features with the textual attributes parsed from the sentences by using a novel contrastive learning loss. ...
Works in vision and language propagate the notion of visual semantic embedding, with a goal to learn joint feature space for both visual inputs and their correspondent textual annotations [10, 53] . ...
arXiv:2005.07327v2
fatcat:6eww5ur4uzbvvhrmgny5jknusu
Enhancing Video Summarization via Vision-Language Embedding
2017
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
objectives computed on features from a joint vision-language embedding space. ...
Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video. ...
Acknowledgements: We would like to thank Emily Fortuna and Aseem Agarwala for discussions and feedback on this work. ...
doi:10.1109/cvpr.2017.118
dblp:conf/cvpr/PlummerBL17
fatcat:m3pmjulzaradhknim4kdgh2bk4
« Previous
Showing results 1 — 15 out of 7,681 results