1,657 Hits in 4.2 sec

Zero-Shot Activity Recognition with Videos [article]

Evin Pinar Ornek
2020 arXiv   pre-print
We introduce an auto-encoder based model to construct a multimodal joint embedding space between the visual and textual manifolds.  ...  In this paper, we examined the zero-shot activity recognition task with the usage of videos.  ...  Joint Embedding Space We have two sources for multimodal understanding, (a) a video vector with C features that represent the most important spatio-temporal features of the activity, (b) a word vector  ... 
arXiv:2002.02265v1 fatcat:umbgctxyzzbvhcfrkg7dgfnciq

Routing with Self-Attention for Multimodal Capsule Networks [article]

Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
2021 arXiv   pre-print
The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio.  ...  This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods while still being computationally efficient.  ...  Comparison with the state-of-the-art We report the results on the text-to-video retrieval task for YouCook2 and MSR-VTT in Table 1 for two cases, zero-shot text-to-video retrieval (VT) and zero-shot  ... 
arXiv:2112.00775v1 fatcat:tyf5wf7bbveaxnbzqldjrok4hu

Editorial for the ICMR 2018 special issue

Benoit Huet, Qi Tian, Keiji Yanai
2019 International Journal of Multimedia Information Retrieval  
The paper by Mithun et al., "Joint Embeddings with Multimodal Cues for Video-Text Retrieval" received the Best Paper Award at the conference.  ...  The authors propose a multimodal model that computes audio-visual embeddings for video-text retrieval.  ... 
doi:10.1007/s13735-019-00168-9 fatcat:irnfud2fszbq7mx25pno73b2sm

Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos [article]

Benet Oriol, Jordi Luque, Ferran Diego, Xavier Giro-i-Nieto
2020 arXiv   pre-print
The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues.  ...  The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality  ...  We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used in this work.  ... 
arXiv:2006.00785v1 fatcat:yee3epxdozhgdalg3tg3d7zxwq

A Joint Sequence Fusion Model for Video Question Answering and Retrieval [article]

Youngjae Yu, Jongseok Kim, Gunhee Kim
2018 arXiv   pre-print
Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA.  ...  We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence).  ...  We thank Jisung Kim and Antoine Miech for helpful comments about the model. This research was supported by Brain Research Program by National Research Foundation of Korea (NRF) (2017M3C7A1047860).  ... 
arXiv:1808.02559v1 fatcat:dvcj652bejckvfx7egrr5c4zmm

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Huy Manh Nguyen, Tomo Miyazaki, Yoshihiro Sugaya, Shinichiro Omachi
2021 Applied Sciences  
Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other.  ...  Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset.  ...  Like image-text retrieval approaches, most video-to-text retrieval methods learn a joint embedding space [29] [30] [31] .  ... 
doi:10.3390/app11073214 fatcat:kslr4uyewrapbcus34awf7fdpq

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence [article]

Huy Manh Nguyen, Tomo Miyazaki, Yoshihiro Sugaya, Shinichiro Omachi
2020 arXiv   pre-print
Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other.  ...  Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset.  ...  Video and Sentence Embedding As same as image-text retrieval approaches, most videoto-text retrieval methods learn a joint embedding space [19] - [21] .  ... 
arXiv:2004.07967v1 fatcat:56smunbq65ct7okz2ccl3mlnne

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [article]

Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, Guangwei Yu
2022 arXiv   pre-print
Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text.  ...  Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison.  ...  Our goal is to bootstrap from a pre-trained joint text-image model and extend it towards a joint text-video model for the task of text-video retrieval. Text-Video Retrieval.  ... 
arXiv:2203.15086v1 fatcat:cde3cco37jhhrpojotudm33ihu

Multimodal Conversational AI: A Survey of Datasets and Approaches [article]

Anirudh Sundar, Larry Heck
2022 arXiv   pre-print
Finally, we identify multimodal co-learning as a promising direction for multimodal conversational AI research.  ...  Multimodal expressions are central to conversations; a rich set of modalities amplify and often compensate for each other.  ...  Srivastava and Salakhutdinov (2014) developed a multimodal Deep Boltzmann Machine for image-text retrieval and ASR using videos.  ... 
arXiv:2205.06907v1 fatcat:u6kehgeeq5aefdlvv5bpbwsvsa

ActBERT: Learning Global-Local Video-Text Representations [article]

Linchao Zhu, Yi Yang
2020 arXiv   pre-print
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data.  ...  It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling.  ...  With the guidance from the action features, we on video-text joint representation learning.  ... 
arXiv:2011.07231v1 fatcat:xh6lvxh4cfhylffq6ewlynftlq

Sign Language Video Retrieval with Free-Form Textual Queries [article]

Amanda Duarte, Samuel Albanie, Xavier Giró-i-Nieto, Gül Varol
2022 arXiv   pre-print
We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.  ...  To address this gap, in this work we introduce the task of sign language retrieval with free-form textual queries: given a written query (e.g., a sentence) and a large collection of sign language videos  ...  In this work, we address the task of sign language video retrieval with free-form textual queries by learning a joint embedding space between text and video as illustrated in Fig. 1 .  ... 
arXiv:2201.02495v1 fatcat:sqapj2bkvvdetktmhcmsylgq2m

End-to-end Generative Pretraining for Multimodal Video Captioning [article]

Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
2022 arXiv   pre-print
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification  ...  We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video  ...  Video Retrieval: The common practice for retrieval is to train a video-text joint embedding using discriminative losses only, typically in the form of a standard NCE loss [14] , where each video clip  ... 
arXiv:2201.08264v2 fatcat:zj3fkpcnfzhyjmfgtvhkr3ljwy

Deep Multimodal Learning for Affective Analysis and Retrieval

Lei Pang, Shiai Zhu, Chong-Wah Ngo
2015 IEEE transactions on multimedia  
More importantly, the joint representation enables emotion-oriented cross-modal retrieval, for example, retrieval of videos using the text query "crazy cat".  ...  Few attempts for combined analysis of multiple media are made, despite that emotion can be viewed as an expression of multimodal experience.  ...  Video Retrieval Three sets of experiments are conducted by using the text, video and multimodal (text+video) queries. For each query, a joint representation is extracted using the proposed E-MDBM.  ... 
doi:10.1109/tmm.2015.2482228 fatcat:7tozmatnhvbj7hjjohkofngecq

Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text

Ayush Jaiswal, Ekraam Sabir, Wael AbdAlmageed, Premkumar Natarajan
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs).  ...  Real world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g. captions, user comments, etc.) and metadata.  ...  Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation thereon.  ... 
doi:10.1145/3123266.3123385 dblp:conf/mm/JaiswalSAN17 fatcat:aq2sifpg6ncy3os42i5uvlp43a

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
We also address task-specific trends, along with their evaluation strategies and upcoming challenges.  ...  Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data.  ...  On similar lines, the joint vision and text embeddings space was learnt using large-scale pre-training [133] for a variety of multiplex multimodal tasks, including VCR.  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u
« Previous Showing results 1 — 15 out of 1,657 results