8,811 Hits in 5.6 sec

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [article]

Wangbo Zhao, Kai Wang, Xiangxiang Chu, Fuzhao Xue, Xinchao Wang, Yang You
2022 arXiv   pre-print
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.  ...  Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames.  ...  Acknowledgments: We thank Google TFRC for supporting us to get access to the Cloud TPUs.  ... 
arXiv:2204.02547v1 fatcat:5t2grxhb55hxbdvqh3ibmionuy

End-to-end Multi-modal Video Temporal Grounding [article]

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang
2021 arXiv   pre-print
Furthermore, we apply intra-modal self-supervised learning to enhance feature representations across videos for each modality, which also facilitates multi-modal learning.  ...  We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description.  ...  Since the multi-modal feature contains information from the whole video, we only consider features that contain the action by extracting the corresponding video segment.  ... 
arXiv:2107.05624v2 fatcat:z4lsnvdfibb5xn5ynao62ud4vu

A Multimodal Framework for Video Ads Understanding [article]

Zejia Weng, Lingchen Meng, Rui Wang, Zuxuan Wu, Yu-Gang Jiang
2021 arXiv   pre-print
In multi-modal tagging, we first compute clip-level visual features by aggregating frame-level features with NeXt-SoftDBoF.  ...  In our framework, we break down the video structuring analysis problem into two tasks, i.e., scene segmentation and multi-modal tagging.  ...  We further use focal loss [9] for improved performance. Multi-Modal Tagging for Scenes Video ads are multi-modal in nature.  ... 
arXiv:2108.12868v1 fatcat:2632kdpoozb67j7ql4kbdkce2m

Story boundary detection in large broadcast news video archives

Tat-Seng Chua, Shih-Fu Chang, Lekha Chaisorn, Winston Hsu
2004 Proceedings of the 12th annual ACM international conference on Multimedia - MULTIMEDIA '04  
The general results indicate that the judicious use of multi-modality features coupled with rigorous machine learning models could produce effective solutions.  ...  The techniques employed range from simple anchor person detector to sophisticated machine learning models based on HMM and Maximum Entropy (ME) approaches.  ...  To improve the accuracy, we must utilize multi-modality features, effective structure models and knowledge of news video domain.  ... 
doi:10.1145/1027527.1027679 dblp:conf/mm/ChuaCCH04 fatcat:f2rbucutkngljftuk5l2wutybu

A Multi-level Alignment Training Scheme for Video-and-Language Grounding [article]

Yubo Zhang, Feiyang Niu, Qing Ping, Govind Thattai
2022 arXiv   pre-print
To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities.  ...  Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics.  ...  In these approaches, both visual and textual features are fed into a transformer-based model usually pre-trained with multiple losses.  ... 
arXiv:2204.10938v2 fatcat:bp2kuuhecbhfhd3qse6rcgh3gm

Content-based video indexing for sports applications using integrated multi-modal approach

Dian Tjondronegoro, Yi-Ping Phoebe Chen, Binh Pham
2005 Proceedings of the 13th annual ACM international conference on Multimedia - MULTIMEDIA '05  
This doctoral consists of a research work based on an integrated multi-modal approach for sports video indexing and retrieval.  ...  To sustain an ongoing rapid growth of video information, there is an emerging demand for a sophisticated content-based video indexing system.  ...  This doctoral consists of a research work based on an integrated multi-modal approach for sports video indexing and retrieval.  ... 
doi:10.1145/1101149.1101362 dblp:conf/mm/TjondronegoroCP05 fatcat:r7flpxbh4fbfrbtnepauzo26du

From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering [article]

Jiangtong Li, Li Niu, Liqing Zhang
2022 arXiv   pre-print
For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason.  ...  Video understanding has achieved great success in representation learning, such as video caption, video object grounding, and video descriptive question-answer.  ...  We thank all annotators for their remarkable work in data annotation.  ... 
arXiv:2205.14895v1 fatcat:uwhnm2jp7ranhkl6ewttpnbt3a

Enhanced video analytics for sentiment analysis based on fusing textual, auditory and visual information

Sadam Al-Azani, El-Sayed M. El-Alfy
2020 IEEE Access  
Moreover, an enhanced approach is presented for video analytics to predict the speaker's sentiment of multi-dialect Arabic through the integration of textual, auditory and visual modalities.  ...  Then, the effectiveness of various combinations of modalities is verified using multi-level fusion (feature, score and decision).  ...  Sadam Al-Azani also acknowledges the Scholarship provided by Thamar University, Yemen, for his higher studies.  ... 
doi:10.1109/access.2020.3011977 fatcat:2cqvyszmb5a3levrnekmnkokce

Fusion methods for multi-modal indexing of web data

Usman Niaz, Bernard Merialdo
2013 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS)  
issues faced with fusing several modalities having different properties in the context of semantic indexing.  ...  We browse through the literature looking at various state of the art multi-modal fusion techniques varying from naive combination of modalities to more complex methods of machine learning and discuss various  ...  [16] divides a video in segments and selects the maximum score from different classifiers on multiple modalities as the best decision for that segment for identifying person P in that video segment.  ... 
doi:10.1109/wiamis.2013.6616129 dblp:conf/wiamis/NiazM13 fatcat:t65jw3zkbrbqdfe3iac7qrjx6q

Overview of multimodal techniques for the characterization of sport programs

Nicola Adami, Riccardo Leonardi, Pierangelo Migliorati, Touradj Ebrahimi, Thomas Sikora
2003 Visual Communications and Image Processing 2003  
First we consider the techniques based on visual information, then the methods based on audio information, and finally the algorithms based on audio-visual cues, used in a multi-modal fashion.  ...  We focus this analysis on the typology of the signal (audio, video, text captions, ...) from which the low-level features are extracted.  ...  The analysis focus on the typology of the signal (audio, video, text, multi-modal, ...) from which the low-level features are extracted. The paper is organized as follows.  ... 
doi:10.1117/12.510136 fatcat:6dscomwombhengno2znt5sbl6a

The Segmentation and Classification of Story Boundaries in News Video [chapter]

Lekha Chaisorn, Tat-Seng Chua
2002 Visual and Multimedia Information Management  
The segmentation and classification of news video into single-story semantic units is a challenging problem. This research proposes a two-level, multi-modal framework to tackle this problein.  ...  The video is analyzed at the shot and story unit (or scene) levels using a variety of features and techniques.  ...  The authors would also like to thank Chin-Hui Lee, Rudy Setiono and Wee-Kheng Leow for their comments and fruitful discussions on this research.  ... 
doi:10.1007/978-0-387-35592-4_8 fatcat:euyz3qqpr5gxdlsntlkdlumwba

Multi-Modal Multiple-Instance Learning and Attribute Discovery with the Application to the Web Violent Video Detection [chapter]

Shuai Hao, Ou Wu, Weiming Hu, Jinfeng Yang
2013 Lecture Notes in Computer Science  
In this paper, we classified the video into violent and nonviolent using Multi-Modal Multiple-Instance Learning and Attribute Discovery approach by combining audio-video with text information for web video  ...  Therefore violent video recognition is becoming important for web content filtering.  ...  To do so we merge attribute words based on text analysis-for example merging attributes with high co-occurence or matching words.  ... 
doi:10.1007/978-3-642-42057-3_57 fatcat:axlmteqrdba4plkpah443vvzx4

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [article]

Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen
2021 arXiv   pre-print
Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text  ...  Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames  ...  for semantic and motion modalities.  ... 
arXiv:2106.11097v1 fatcat:rsy5ezan6nfljajefczxp5pejy

Multi-source Multi-modal Activity Recognition in Aerial Video Surveillance

Riad I. Hammoud, Cem S. Sahin, Erik P. Blasch, Bradley J. Rhodes
2014 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops  
We present a multi-source multi-modal activity/event recognition system for surveillance applications, consisting of: (1) detecting and tracking multiple dynamic targets from a moving platform, (2) representing  ...  In the context of this research, we deal with two unsynchronized data sources collected in real-world operating scenarios: full-motion videos (FMV) and analyst call-outs (ACO) in the form of chat messages  ...  The authors would like to thank Adnan Bubalo (AFRL), Robert Biehl, Brad Galego, Helen Webb and Michael Schneider (BAE Systems) for their support.  ... 
doi:10.1109/cvprw.2014.44 dblp:conf/cvpr/HammoudSBR14 fatcat:p7pqnfbzefdplpm5eztcgk2lge

Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning [article]

Samarth Tripathi, Sarthak Tripathi, Homayoon Beigi
2019 arXiv   pre-print
With the advancement of technology our understanding of emotions are advancing, there is a growing need for automatic emotion recognition systems.  ...  In this paper we attempt to exploit this effectiveness of Neural networks to enable us to perform multimodal Emotion recognition on IEMOCAP dataset using data from Speech, Text, and Motion capture data  ...  The Text Model2 with stacked LSTMs and Glove word embeddings is chosen for text modality, Speech Model4 for the speech modality with 2 stacked bidirections LSTMs with Attention, and combined Mocap Model1  ... 
arXiv:1804.05788v3 fatcat:5bxu2yszsjcjbli5ec3ju3lwky
« Previous Showing results 1 — 15 out of 8,811 results