Filters








10,085 Hits in 6.6 sec

Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model

Israel D. Gebru, Sileye Ba, Georgios Evangelidis, Radu Horaud
2015 2015 IEEE International Conference on Computer Vision Workshop (ICCVW)  
A probabilistic tracker exploits the on-image (spatial) coincidence of visual and auditory observations and infers a single latent variable which represents the identity of the active speaker.  ...  We here cast the diarization problem into a tracking formulation whereby the active speaker is detected and tracked over time.  ...  We propose a generative observation model, based on the recently proposed weighted-data Gaussian mixture [6] , that evaluates the posterior probability of an observed person to be the active speaker,  ... 
doi:10.1109/iccvw.2015.96 dblp:conf/iccvw/GebruBEH15 fatcat:lruasrz6sfgn7imwdwdd7gne2y

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

Daniel Gatica-Perez, Guillaume Lathoud, Jean-Marc Odobez, Iain McCowan
2007 IEEE Transactions on Audio, Speech, and Language Processing  
Visual observations are based on models of the shape and spatial structure of human heads.  ...  The model integrates audiovisual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm.  ...  ACKNOWLEDGMENT The authors would like to thank K. Smith (IDIAP Research Institute) for discussions, and all the participants in the meeting sequences for their time.  ... 
doi:10.1109/tasl.2006.881678 fatcat:2pmgi6psnjcnfahlapkiifwcmm

Improving hands-free speech recognition in a car through audio-visual voice activity detection

Friedrich Faubel, Munir Georges, Kenichi Kumatani, Andres Bruhn, Dietrich Klakow
2011 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays  
Audio-visual voice activity detection has the advantage of being more robust in acoustically demanding environments.  ...  In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audio-visual voice activity detection (VAD) with microphone array processing techniques  ...  AUDIO-VISUAL FEATURE EXTRACTION As mentioned before, audio-visual voice activity detection on the AVICAR [6] corpus poses a serious challenge.  ... 
doi:10.1109/hscma.2011.5942412 fatcat:tyowwiystvff5oidev4qgvne2m

Multimodal multispeaker probabilistic tracking in meetings

Daniel Gatica-Perez, Guillaume Lathoud, Jean-Marc Odobez, Iain McCowan
2005 Proceedings of the 7th international conference on Multimodal interfaces - ICMI '05  
The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm.  ...  Visual observations are based on models of the shape and spatial structure of human heads.  ...  Acknowledgements This work was supported by the Swiss National Center of Competence in Research on Interactive Multimodal Information Management (IM2), and the EC projects Multimodal Meeting Manager (M4  ... 
doi:10.1145/1088463.1088496 dblp:conf/icmi/Gatica-PerezLOM05 fatcat:7enim62zkbchtkwtpqje2o6k2q

End-to-end multi-talker audio-visual ASR using an active speaker attention module [article]

Richard Rose, Olivier Siohan
2022 arXiv   pre-print
This is implemented as a transformer-transducer based end-to-end model and evaluated using a two speaker audio-visual overlapping speech dataset created from YouTube videos.  ...  The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces.  ...  The authors would like to thank Takaki Makino and Hank Liao for their contributions to A/V speech corpus development, Otavio Braga for work on efficient spatiotemporal convolutions for video analysis,  ... 
arXiv:2204.00652v1 fatcat:pvgbpsawmnhn3d7pnfkje4zfq4

Learning cross-modal appearance models with application to tracking

J.W. Fisher, T. Darrell
2003 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698)  
We consider the problem of simulraneously leamirig and audio and visual appearance model of a moving subject.  ...  We present a method which successfully leoms such a model wifhout benefit of hand initialization using only the associated audio signal 10 "decide" which object to model and track.  ...  estimate dependent bases: Given 11-14 Estimating A N Dependency In [ I ] we presented an approach for leaming joint audio-visual statistical models based on a nonparametric estimate of mutual information  ... 
doi:10.1109/icme.2003.1221541 dblp:conf/icmcs/FisherD03 fatcat:bvabrdic4rg7rmjdivssl7jeoe

Deep Metric Learning-Assisted 3D Audio-Visual Speaker Tracking via Two-Layer Particle Filter

Yidi Li, Hong Liu, Bing Yang, Runwei Ding, Yang Chen
2020 Complexity  
The current challenges are focused on the construction of a stable observation model.  ...  To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle filter framework.  ...  A joint observation model is proposed in [11] , which fuses audio, shape, and structure observations derived from audio and video in a multiplicative likelihood. e visual observation model [12] is derived  ... 
doi:10.1155/2020/3764309 fatcat:ggnkyo2z2nbublaodrmf4bx35i

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

Yutong Ban, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud
2019 IEEE Transactions on Pattern Analysis and Machine Intelligence  
We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model.  ...  In this article, we address the problem of tracking multiple speakers via the fusion of visual and auditory information.  ...  based on observing the audio activity over time.  ... 
doi:10.1109/tpami.2019.2953020 pmid:31751223 fatcat:vghfsjsecbahbfn3d3lps3f3rm

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers [article]

Yutong Ban, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud
2019 arXiv   pre-print
We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model.  ...  In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information.  ...  based on observing the audio activity over time.  ... 
arXiv:1809.10961v2 fatcat:4gnillmunrahdcgwpg7aqjool4

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection [article]

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru
2019 arXiv   pre-print
We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.  ...  The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy.  ...  We also present a joint audiovisual modeling approach for the active speaker detection task, which reduces the errors in visual-only approaches by 36%, and present an analysis of model performance across  ... 
arXiv:1901.01342v2 fatcat:46azdafq4jf4tar5o2umrn3mey

Audio constrained particle filter based visual tracking

Volkan Kilic, Mark Barnard, Wenwu Wang, Josef Kittler
2013 2013 IEEE International Conference on Acoustics, Speech and Signal Processing  
for joint audio-visual (AV) 0 0 0 0 1 tracking based on particle filtering (PF).  ...  In this paper, we propose a new 2. PARTICLE FILTER BASED VISUAL TRACKING method of fusing audio into the PF based visual tracking.  ... 
doi:10.1109/icassp.2013.6638334 dblp:conf/icassp/KilicBWK13 fatcat:lqgiqqlviva4dn4ghlbai3uvfe

Voxel-based Viterbi Active Speaker Tracking (V-VAST) with best view selection for video lecture post-production

Damien Kelly, Anil Kokaram, Frank Boland
2011 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
The Viterbi algorithm is then used to estimate a track of the active speaker which maximizes the observed speech activity.  ...  This novel approach is termed Voxel-based Viterbi Active Speaker Tracking (V-VAST) and is shown to track speakers with an accuracy of 0.23m.  ...  The Viterbi algorithm is then used to estimate an active speaker track in 3D which maximizes the observed speech activity.  ... 
doi:10.1109/icassp.2011.5946941 dblp:conf/icassp/KellyKB11 fatcat:4hkechvwm5cfdl2xvn3crkohla

Probabilistic integration of sparse audio-visual cues for identity tracking

Keni Bernardin, Rainer Stiefelhagen, Alex Waibel
2008 Proceeding of the 16th ACM international conference on Multimedia - MM '08  
A probabilistic model is used to keep track of identified persons and update the belief in their identities whenever new observations can be made.  ...  The technique has been systematically evaluated on the CLEAR Interactive Seminar database, a large audio-visual corpus of realistic meeting scenarios captured in a variety of smart rooms.  ...  The authors wish to thank Hazim Ekenel, Tobias Gehrig and Qin Jin for their invaluable contributions to this work.  ... 
doi:10.1145/1459359.1459380 dblp:conf/mm/BernardinSW08 fatcat:m436blauoveippt4543jlj6rhy

Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking

Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, Radu Horaud
2017 2017 IEEE International Conference on Computer Vision Workshops (ICCVW)  
In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces.  ...  Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent.  ...  In this paper we propose a novel multi-speaker tracking method inspired from previous research on "instantaneous" audio-visual fusion [11, 12] .  ... 
doi:10.1109/iccvw.2017.60 dblp:conf/iccvw/BanGAH17 fatcat:izqulwjbp5bfvgndsri5a4vrku

Look Who's Talking: Active Speaker Detection in the Wild [article]

You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung
2021 arXiv   pre-print
Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.  ...  In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously.  ...  It predicts whether or not the visible speaker is speaking based on the correlation between the audio and the video embeddings.  ... 
arXiv:2108.07640v1 fatcat:nhcapueoljb3fk5dhjdgtmq6ce
« Previous Showing results 1 — 15 out of 10,085 results