Filters








7,354 Hits in 4.5 sec

Audio-visual speaker localization via weighted clustering

Israel D. Gebru, Xavier Alameda-Pineda, Radu Horaud, Florence Forbes
2014 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)  
The clustering algorithm is applied to the problem of detecting and localizing a speaker over time using both visual and auditory observations gathered with a single camera and two microphones.  ...  We propose a novel weighted clustering method based on a finite mixture model which explores the idea of non-uniform weighting of observations.  ...  A number of authors addressed speaker localization based on audio-visual fusion.  ... 
doi:10.1109/mlsp.2014.6958874 dblp:conf/mlsp/GebruAHF14 fatcat:fpdroldve5bupgqrbzwidaa76i

Portable meeting recorder

Dar-Shyang Lee, Berna Erol, Jamey Graham, Jonathan J. Hull, Norihiko Murata
2002 Proceedings of the tenth ACM international conference on Multimedia - MULTIMEDIA '02  
Composed of an omni-directional video camera with four-channel audio capture, the system saves a view of all the activity in a meeting and the directions from which people spoke.  ...  Subsequent analysis computes metadata that includes video activity analysis of the compressed data stream and audio processing that helps locate events that occurred during the meeting.  ...  Our experiments showed that basing speaker segmentation on the results of sound localization performed much better than using audio features for speaker clustering.  ... 
doi:10.1145/641007.641111 dblp:conf/mm/LeeEGHM02 fatcat:uzgit4hzercd3g6nai7md3nz3e

Portable meeting recorder

Dar-Shyang Lee, Berna Erol, Jamey Graham, Jonathan J. Hull, Norihiko Murata
2002 Proceedings of the tenth ACM international conference on Multimedia - MULTIMEDIA '02  
Composed of an omni-directional video camera with four-channel audio capture, the system saves a view of all the activity in a meeting and the directions from which people spoke.  ...  Subsequent analysis computes metadata that includes video activity analysis of the compressed data stream and audio processing that helps locate events that occurred during the meeting.  ...  Our experiments showed that basing speaker segmentation on the results of sound localization performed much better than using audio features for speaker clustering.  ... 
doi:10.1145/641108.641111 fatcat:c5ttckn7hvcabfw3fmeu6kwfji

Visual model structures and synchrony constraints for audio-visual speech recognition

T.J. Hazen
2006 IEEE Transactions on Audio, Speech, and Language Processing  
This paper presents the design and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy.  ...  between the audio and visual streams.  ...  The decision fusion between the audio and visual models is performed via a weighted linear combination of the segment-level scores generated from each model.  ... 
doi:10.1109/tsa.2005.857572 fatcat:htu747fqsvbwflee4tsppocpzm

EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis

Israel Dejene Gebru, Xavier Alameda-Pineda, Florence Forbes, Radu Horaud
2016 IEEE Transactions on Pattern Analysis and Machine Intelligence  
We also demonstrate the effectiveness and robustness of the proposed clustering technique in the presence of heterogeneous data, namely audio-visual scene analysis.  ...  art parametric and non-parametric clustering techniques.  ...  localize active speakers in complex audio-visual scenes.  ... 
doi:10.1109/tpami.2016.2522425 pmid:27824582 fatcat:5sbwwa7hgnggppectm7hskif2u

Multiple active speaker localization based on audio-visual fusion in two stages

Zhao Li, Thorsten Herfet, Martin Grochulla, Thorsten Thormahlen
2012 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)  
In the first stage, speaker activity is detected based on the audio-visual fusion which can handle false lip movements.  ...  The audio modality alone has problems with localization accuracy while the video modality alone has problems with false speaker activity detections.  ...  Audio-visual Results Integration Based on the audio-visual fusion, the speaker activity is plausible and the location of each active speaker in view can be used as the final localization result.  ... 
doi:10.1109/mfi.2012.6343015 dblp:conf/mfi/0001HGT12 fatcat:nidn2crarndrllcmiksktniux4

Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model

Israel D. Gebru, Sileye Ba, Georgios Evangelidis, Radu Horaud
2015 2015 IEEE International Conference on Computer Vision Workshop (ICCVW)  
The modules that translate raw audio and visual data into on-image observations are also described in detail.  ...  Both visual and auditory observations are explained by a recently proposed weighted-data mixture model, while several options for the speaking turns dynamics are fulfilled by a multi-case transition model  ...  The proposed audio-visual tracker associates people detected in the image sequence with these sound directions via audio-visual clustering that is combined with an active-speaker transition model.  ... 
doi:10.1109/iccvw.2015.96 dblp:conf/iccvw/GebruBEH15 fatcat:lruasrz6sfgn7imwdwdd7gne2y

Stream weight estimation for multistream audio–visual speech recognition in a multispeaker environment

Xu Shao, Jon Barker
2008 Speech Communication  
The paper considers the problem of audio-visual speech recognition in a simultaneous (target/masker) speaker environment.  ...  The paper follows a conventional multistream approach and examines the specific problem of estimating reliable timevarying audio and visual stream weights.  ...  The audio-visual recognition accuracy is plotted against global SNR and is shown for both a constant stream weight (estimated from the global SNR) and a time-varying stream weight (estimated from the local  ... 
doi:10.1016/j.specom.2007.11.002 fatcat:65nnhdkq45f5ha7vi6jazqypae

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [article]

Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura
2021 arXiv   pre-print
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions in the localization space.  ...  This fusion is achieved via a neural network, which combines the predictions of individual audio and video trackers based on their time- and location-dependent reliability.  ...  speaker localization and tracking [16, 17] .  ... 
arXiv:2102.11588v2 fatcat:5lfluv74kzejponlebqgfgb7v4

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

Ahmad, Zubair, Alquhayz, Ditta
2019 Sensors  
In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization.  ...  This method helps in generating speaker specific clusters with high probability.  ...  This is achieved by sound source localization in the audio and multiple person visual tracking in the video which are fused via a supervised technique.  ... 
doi:10.3390/s19235163 pmid:31775385 pmcid:PMC6929047 fatcat:5uncqvakpzbqzdsakafilsgbbu

Audio-Visual Perception System for a Humanoid Robotic Head

Raquel Viciana-Abad, Rebeca Marfil, Jose Perez-Lorenzo, Juan Bandera, Adrian Romero-Garces, Pedro Reche-Lopez
2014 Sensors  
Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors.  ...  With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio  ...  In particular, the approach followed to implement this sensor consists of fusing visual and audio evidence about the presence of a speaker via using a Bayes network.  ... 
doi:10.3390/s140609522 pmid:24878593 pmcid:PMC4118331 fatcat:e6lggdjimrayjemvere46mkzwy

Spotting Audio-Visual Inconsistencies (SAVI) in Manipulated Video

Robert Bolles, J. Brian Burns, Martin Graciarena, Andreas Kathol, Aaron Lawson, Mitchell McLaren, Thomas Mensink
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)  
The speaker identity inconsistency process was challenged by the complexity of comparing face tracks and audio speech clusters, requiring a novel method of fusing these two sources.  ...  Here, we focus on inconsistencies between the type of scenes detected in the audio and visual modalities (e.g., audio indoor, small room versus visual outdoor, urban), and inconsistencies in speaker identity  ...  audio speaker clusters.  ... 
doi:10.1109/cvprw.2017.238 dblp:conf/cvpr/BollesBGKLMM17 fatcat:ihsgx3d2xrgcjinipty4mjexwa

AVA-AVD: Audio-Visual Speaker Diarization in the Wild [article]

Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye, Mike Zheng Shou
2022 arXiv   pre-print
Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals.  ...  To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset.  ...  Given a video with multiple speakers (Figure 3 ), our goal is to localize the audible utterances and label them with video level speaker identity leveraging audio-visual cues.  ... 
arXiv:2111.14448v4 fatcat:kz33equ5xbhs3asencw4uchou4

Online Diarization of Streaming Audio-Visual Data for Smart Environments

Joerg Schmalenstroeer, Reinhold Haeb-Umbach
2010 IEEE Journal on Selected Topics in Signal Processing  
In this paper, a system for joint temporal segmentation, speaker localization, and identification is presented, which is supported by face identification from video data obtained from a steerable camera  ...  to steer the camera towards the speaker during ambient communication.  ...  Thus, in a typical setup the wireless RFID localization system tracks room changes of the user to redirect the audio-visual data to the entered room, while the speaker diarization locates the speaker within  ... 
doi:10.1109/jstsp.2010.2050519 fatcat:w2r5ddnxkvaj3fwfc7pqr2eb6y

A segment-based audio-visual speech recognizer

Timothy J. Hazen, Kate Saenko, Chia-Hao La, James R. Glass
2004 Proceedings of the 6th international conference on Multimodal interfaces - ICMI '04  
This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy.  ...  To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers.  ...  Audio-Visual Asynchrony There is an inherent asynchrony between the visual and audio cues of speech. Speech is produced via the closely coordinated movement of several articulators.  ... 
doi:10.1145/1027933.1027972 dblp:conf/icmi/HazenSLG04 fatcat:nwun67u7kfhono46ob6bo5tucy
« Previous Showing results 1 — 15 out of 7,354 results