2,618 Hits in 4.9 sec

Multi-modal information fusion for news story segmentation in broadcast video

Bailan Feng, Peng Ding, Jiansong Chen, Jinfeng Bai, Su Xu, Bo Xu
2012 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
In this paper, we propose a novel news story segmentation scheme which can segment broadcast video into story units with multi-modal information fusion (MMIF) strategy.  ...  Parallel to this, we make use of a multi-modal information fusion strategy for news story boundary characterization by joining these visual, audio and textual cues.  ...  CONCLUSIONS In this paper, we have presented a novel multi-modal information fusion scheme for news story segmentation.  ... 
doi:10.1109/icassp.2012.6288156 dblp:conf/icassp/FengDCBXX12 fatcat:g5g7fohlhre7fetvclum5ybsje

Everything at Once – Multi-modal Fusion Transformer for Video Retrieval [article]

Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
2022 arXiv   pre-print
In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them  ...  into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information.  ...  Limitations and Conclusion In this work, we propose a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and  ... 
arXiv:2112.04446v2 fatcat:dm3b2kcbdzazffcjyjrxc7fixa

Multi-modal characteristics analysis and fusion for TV commercial detection

Nan Liu, Yao Zhao, Zhenfeng Zhu, Hanqing Lu
2010 2010 IEEE International Conference on Multimedia and Expo  
To boost the ability of discrimination of commercial from general program in multi-modal representation space, Tri-AdaBoost, a self-learning method by an interactive way across multiple modalities, is  ...  In this paper, a multi-modal (i.e. visual, audio and textual modalities) commercial digesting scheme is proposed to alleviate two challenges in commercial detection, which are the generation of mid-level  ...  For instance, it is desirable to filter out the annoying commercials for the TV viewers, who are likely to use digital TV set-top boxes to record the interesting TV programs.  ... 
doi:10.1109/icme.2010.5583867 dblp:conf/icmcs/LiuZZL10 fatcat:kklntbnil5hctkctr74wmny7pe

Multimodal Interaction Recognition Mechanism by Using Midas Featured By Data-Level and Decision-Level Fusion

Muhammad Habib, Noor ul Qamar
2017 Lahore Garrison University research journal of computer science and information technology  
programming abstractions.  ...  Research on data-level fusion models requires more focus on the fusion of multiple degradation-based sensor data.  ...  Criteria for Expressing Multi-Level: In this dissertation we focus on a fusion framework with both high-level programming language and architectural support.  ... 
doi:10.54692/lgurjcsit.2017.010227 fatcat:cqvkqnfafzf6fkz2bkmjcfybwq

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [article]

Yang Jiao, Zequn Jie, Weixin Luo, Jingjing Chen, Yu-Gang Jiang, Xiaolin Wei, Lin Ma
2021 arXiv   pre-print
and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed.  ...  And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work.  ...  Cross-modal Fusion RIS model takes visual and textual features as input, and outputs a predicted segmentation mask.  ... 
arXiv:2110.04435v1 fatcat:zt23iztwbjbdhlxaxsnvfmuyei

Bi-modal biometric authentication on mobile phones in challenging conditions

Elie Khoury, Laurent El Shafey, Christopher McCool, Manuel Günther, Sébastien Marcel
2014 Image and Vision Computing  
It is also shown that multi-algorithm fusion provides a consistent performance improvement for face, speaker and bi-modal authentication.  ...  Using this bi-modal multi-algorithm system we derive a state-of-the-art authentication system that obtains a half total error rate of 6.3% and 1.9% for Female and Male trials, respectively.  ...  NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program  ... 
doi:10.1016/j.imavis.2013.10.001 fatcat:mphxii3br5gnrd7ryjieyilrda

Enhanced video analytics for sentiment analysis based on fusing textual, auditory and visual information

Sadam Al-Azani, El-Sayed M. El-Alfy
2020 IEEE Access  
Then, the effectiveness of various combinations of modalities is verified using multi-level fusion (feature, score and decision).  ...  INDEX TERMS Affective states, information fusion, natural language processing, sentiment analysis, social media, web mining, video analytics.  ...  level (TV) then TV was fused with text and visual modalities individually at score or decision level (TV-T-V). • Audio and visual modalities were fused at feature level (AV) then AV was fused with audio  ... 
doi:10.1109/access.2020.3011977 fatcat:2cqvyszmb5a3levrnekmnkokce

Audiovisual speaker diarization of TV series

Xavier Bost, Georges Linares, Serigne Gueye
2015 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
In this paper, we propose to perform speaker diarization within dialogue scenes of TV series by combining the audio and video modalities: speaker diarization is first performed by using each modality,  ...  The results obtained by applying such a multi-modal approach to fictional films turn out to outperform those obtained by relying on a single modality.  ...  In [3] , the authors make use of an intermediate fusion approach to guide speaker diarization in TV broadcast by adding to the set of speech turns new instances originating in other sources of information  ... 
doi:10.1109/icassp.2015.7178882 dblp:conf/icassp/BostLG15 fatcat:wlzajdxw45gdhglevhlnzlucra

QCompere @ REPERE 2013

Hervé Bredin, Johann Poignant, Guillaume Fortier, Makarand Tapaswi, Viet Bac Le, Anindya Roy, Claude Barras, Sophie Rosset, Achintya Kumar Sarkar, Qian Yang, Hua Gao, Alexis Mignon (+5 others)
2013 Conference of the International Speech Communication Association  
Then, depending on the target modality (speaker or face recognition) and on the task (supervised or unsupervised recognition), four different fusion techniques are introduced: they can be summarized as  ...  challenge aims at gathering four communities (face recognition, speaker identification, optical character recognition and named entity detection) towards the same goal: multimodal person recognition in TV  ...  However, other sources of information are available in TV broadcast and can be used to achieve unsupervised person recognition.  ... 
dblp:conf/interspeech/Bredin13 fatcat:n6adpmw2oje6xlll6focosmwr4

Unsupervised Segmentation Methods of TV Contents

Elie El-Khoury, Christine Sénac, Philippe Joly
2010 International Journal of Digital Multimedia Broadcasting  
We present a generic algorithm to address various temporal segmentation topics of audiovisual contents such as speaker diarization, shot, or program segmentation.  ...  used in the field of content-based indexing.  ...  The method for segmenting a TV stream which is built on the detection of nonprogram segments (such as commercial breaks) uses two kinds of independent information.  ... 
doi:10.1155/2010/539796 fatcat:3oa7un2hyrhdtpwabdxwmcarjm

Fusion of Speech, Faces and Text for Person Identification in TV Broadcast [chapter]

Hervé Bredin, Johann Poignant, Makarand Tapaswi, Guillaume Fortier, Viet Bac Le, Thibault Napoleon, Hua Gao, Claude Barras, Sophie Rosset, Laurent Besacier, Jakob Verbeek, Georges Quénot (+2 others)
2012 Lecture Notes in Computer Science  
The Repere challenge is a project aiming at the evaluation of systems for supervised and unsupervised multimodal recognition of people in TV broadcast.  ...  Acknowledgment This work was partly realized as part of the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency  ...  One of the most interesting contribution of this paper is the improvement brought by multi-modal fusion of the written modality with speaker and head ones: around 20% absolute EGER decrease for both of  ... 
doi:10.1007/978-3-642-33885-4_39 fatcat:k4tsmhlqgbhb3n276byfpjaewm

UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task

Miquel India, David Varas, Verónica Vilaplana, Josep Ramon Morros, Javier Hernando
2015 MediaEval Benchmarking Initiative for Multimedia Evaluation  
These technologies are combined using a linear programming approach where some restrictions are imposed.  ...  The system outputs the identity of people that appear, talk and can be identified by using information appearing in the show (in our case, text with person names).  ...  The architecture which combines video and audio modalities after the fusion with the text stream has provided the best results.  ... 
dblp:conf/mediaeval/IndiaVVMH15 fatcat:bi7kbzim2vb2npszyzn4uoei7q

OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification [article]

Ye Liu, Lingfeng Qiao, Di Yin, Zhuoxuan Jiang, Xinghua Jiang, Deqiang Jiang, Bo Ren
2022 arXiv   pre-print
Intuitively, jointly learning of these two tasks can promote each other by sharing common information.  ...  However, scene segmentation concerns more on the local difference between adjacent shots while classification needs the global representation of scene segments, which probably leads to the model dominated  ...  Early fusion means the features of multi modalities are fused before the LSTM in LGSS. The method is denoted as LGSS-early.  ... 
arXiv:2207.01241v1 fatcat:y5rszvfnebapnhu6xf2gjbihvu

MSRA-USTC-SJTU at TRECVID 2007: High-Level Feature Extraction and Search

Tao Mei, Xian-Sheng Hua, Wei Lai, Linjun Yang, Zheng-Jun Zha, Yuan Liu, Zhiwei Gu, Guo-Jun Qi, Meng Wang, Jinhui Tang, Xun Yuan, Zheng Lu (+1 others)
2007 TREC Video Retrieval Evaluation  
For automatic search, we fuse text, visual example, and concept-based models while using temporal consistency and face information for re-ranking and result refinement.  ...  well as the correlations between concepts by correlative multi-label learning.  ...  fusion of SVM-TV-EF-CV3, SVM -TVS-EF-CV3, and SVM-TVS-LF-CV3, (3) MLMIK-EF, MLMIK-EF-C+C-, and MLMIK-LF, and (4) CML-LF. • A MSRA USTC SJTU HLF 4: linearly weighted fusion of the top 5 methods for each  ... 
dblp:conf/trecvid/MeiHLYZLGQWTYLL07 fatcat:hkseihmeoncp5bltxvq4z5soqy

SLNSpeech: solving extended speech separation problem by the help of sign language [article]

Jiasong Wu, Taotao Li, Youyong Kong, Guanyu Yang, Lotfi Senhadji, Huazhong Shu
2020 arXiv   pre-print
Then, we design a general deep learning network for the self-supervised learning of three modalities, particularly, using sign language embeddings together with audio or audio-visual information for better  ...  Experiments results show that, besides visual modality, sign language modality can also be used alone to supervise speech separation task.  ...  The video clips of every TV program in Fig. 5 is a pie chart, which represents the proportion of the number of segments per TV program and you can also see it in Table IV .  ... 
arXiv:2007.10629v1 fatcat:umggllkdofcjtc5ub6ugc2wpkq
« Previous Showing results 1 — 15 out of 2,618 results