A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Zero-Shot Activity Recognition with Videos
[article]
2020
arXiv
pre-print
We introduce an auto-encoder based model to construct a multimodal joint embedding space between the visual and textual manifolds. ...
In this paper, we examined the zero-shot activity recognition task with the usage of videos. ...
Joint Embedding Space We have two sources for multimodal understanding, (a) a video vector with C features that represent the most important spatio-temporal features of the activity, (b) a word vector ...
arXiv:2002.02265v1
fatcat:umbgctxyzzbvhcfrkg7dgfnciq
Routing with Self-Attention for Multimodal Capsule Networks
[article]
2021
arXiv
pre-print
The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. ...
This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods while still being computationally efficient. ...
Comparison with the state-of-the-art We report the results on the text-to-video retrieval task for YouCook2 and MSR-VTT in Table 1 for two cases, zero-shot text-to-video retrieval (VT) and zero-shot ...
arXiv:2112.00775v1
fatcat:tyf5wf7bbveaxnbzqldjrok4hu
Editorial for the ICMR 2018 special issue
2019
International Journal of Multimedia Information Retrieval
The paper by Mithun et al., "Joint Embeddings with Multimodal Cues for Video-Text Retrieval" received the Best Paper Award at the conference. ...
The authors propose a multimodal model that computes audio-visual embeddings for video-text retrieval. ...
doi:10.1007/s13735-019-00168-9
fatcat:irnfud2fszbq7mx25pno73b2sm
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos
[article]
2020
arXiv
pre-print
The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. ...
The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality ...
We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used in this work. ...
arXiv:2006.00785v1
fatcat:yee3epxdozhgdalg3tg3d7zxwq
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
[article]
2018
arXiv
pre-print
Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. ...
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). ...
We thank Jisung Kim and Antoine Miech for helpful comments about the model. This research was supported by Brain Research Program by National Research Foundation of Korea (NRF) (2017M3C7A1047860). ...
arXiv:1808.02559v1
fatcat:dvcj652bejckvfx7egrr5c4zmm
Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
2021
Applied Sciences
Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. ...
Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. ...
Like image-text retrieval approaches, most video-to-text retrieval methods learn a joint embedding space [29] [30] [31] . ...
doi:10.3390/app11073214
fatcat:kslr4uyewrapbcus34awf7fdpq
Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
[article]
2020
arXiv
pre-print
Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. ...
Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. ...
Video and Sentence Embedding As same as image-text retrieval approaches, most videoto-text retrieval methods learn a joint embedding space [19] - [21] . ...
arXiv:2004.07967v1
fatcat:56smunbq65ct7okz2ccl3mlnne
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
[article]
2022
arXiv
pre-print
Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text. ...
Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. ...
Our goal is to bootstrap from a pre-trained joint text-image model and extend it towards a joint text-video model for the task of text-video retrieval. Text-Video Retrieval. ...
arXiv:2203.15086v1
fatcat:cde3cco37jhhrpojotudm33ihu
Multimodal Conversational AI: A Survey of Datasets and Approaches
[article]
2022
arXiv
pre-print
Finally, we identify multimodal co-learning as a promising direction for multimodal conversational AI research. ...
Multimodal expressions are central to conversations; a rich set of modalities amplify and often compensate for each other. ...
Srivastava and Salakhutdinov (2014) developed a multimodal Deep Boltzmann Machine for image-text retrieval and ASR using videos. ...
arXiv:2205.06907v1
fatcat:u6kehgeeq5aefdlvv5bpbwsvsa
ActBERT: Learning Global-Local Video-Text Representations
[article]
2020
arXiv
pre-print
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. ...
It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. ...
With the guidance from the action features, we on video-text joint representation learning. ...
arXiv:2011.07231v1
fatcat:xh6lvxh4cfhylffq6ewlynftlq
Sign Language Video Retrieval with Free-Form Textual Queries
[article]
2022
arXiv
pre-print
We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task. ...
To address this gap, in this work we introduce the task of sign language retrieval with free-form textual queries: given a written query (e.g., a sentence) and a large collection of sign language videos ...
In this work, we address the task of sign language video retrieval with free-form textual queries by learning a joint embedding space between text and video as illustrated in Fig. 1 . ...
arXiv:2201.02495v1
fatcat:sqapj2bkvvdetktmhcmsylgq2m
End-to-end Generative Pretraining for Multimodal Video Captioning
[article]
2022
arXiv
pre-print
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification ...
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video ...
Video Retrieval: The common practice for retrieval is to train a video-text joint embedding using discriminative losses only, typically in the form of a standard NCE loss [14] , where each video clip ...
arXiv:2201.08264v2
fatcat:zj3fkpcnfzhyjmfgtvhkr3ljwy
Deep Multimodal Learning for Affective Analysis and Retrieval
2015
IEEE transactions on multimedia
More importantly, the joint representation enables emotion-oriented cross-modal retrieval, for example, retrieval of videos using the text query "crazy cat". ...
Few attempts for combined analysis of multiple media are made, despite that emotion can be viewed as an expression of multimodal experience. ...
Video Retrieval Three sets of experiments are conducted by using the text, video and multimodal (text+video) queries. For each query, a joint representation is extracted using the proposed E-MDBM. ...
doi:10.1109/tmm.2015.2482228
fatcat:7tozmatnhvbj7hjjohkofngecq
Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text
2017
Proceedings of the 2017 ACM on Multimedia Conference - MM '17
We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). ...
Real world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g. captions, user comments, etc.) and metadata. ...
Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation thereon. ...
doi:10.1145/3123266.3123385
dblp:conf/mm/JaiswalSAN17
fatcat:aq2sifpg6ncy3os42i5uvlp43a
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
[article]
2020
arXiv
pre-print
We also address task-specific trends, along with their evaluation strategies and upcoming challenges. ...
Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data. ...
On similar lines, the joint vision and text embeddings space was learnt using large-scale pre-training [133] for a variety of multiplex multimodal tasks, including VCR. ...
arXiv:2010.09522v2
fatcat:l4npstkoqndhzn6hznr7eeys4u
« Previous
Showing results 1 — 15 out of 1,657 results