Filters








3,595 Hits in 4.5 sec

Audio-Visual Clustering for 3D Speaker Localization [chapter]

Vasil Khalidov, Florence Forbes, Miles Hansard, Elise Arnaud, Radu Horaud
Lecture Notes in Computer Science  
We show that the identification and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups.  ...  A microphone array can provide an estimate 3D location of each audio source.  ...  Research Group (Department of Computer Science, University of Sheffield) for helpful discussions and comments.  ... 
doi:10.1007/978-3-540-85853-9_8 fatcat:n4xa6watsve3zjg2yah23rq4dm

Vision-guided robot hearing

Xavier Alameda-Pineda, Radu Horaud
2014 The international journal of robotics research  
In this context, the detection and localisation of speakers plays a key role since it is the pillar on which several tasks (e.g.: speech recognition and speaker tracking) rely.  ...  Indeed, the deterministic component allows us to map the visual information into the auditory space.  ...  The 3D visual features are mapped into the auditory space A through the audio-visual mapping (A • V −1 ).  ... 
doi:10.1177/0278364914548050 fatcat:onjyr7y2jzfhxiytcgjoaeicei

Multimodal Speaker Diarization Utilizing Face Clustering Information [chapter]

Ioannis Kapsouras, Anastasios Tefas, Nikos Nikolaidis, Ioannis Pitas
2015 Lecture Notes in Computer Science  
In this paper, we use visual information to aid speaker clustering.  ...  Multimodal clustering/diarization tries to answer the question "who spoke when" by using audio and visual information.  ...  The European Union is not liable for any use that may be made of the information contained therein.  ... 
doi:10.1007/978-3-319-21963-9_50 fatcat:hrl66z7gdncapa7qaqykvq52oa

Detection and localization of 3d audio-visual objects using unsupervised clustering

Vasil Khalidov, Florence Forbes, Miles Hansard, Elise Arnaud, Radu Horaud
2008 Proceedings of the 10th international conference on Multimodal interfaces - IMCI '08  
It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups.  ...  This model maps the data into a common audio-visual 3D representation via a pair of mixture models.  ...  We use this in particular to determine active speakers using the auditory observations assignments η k 's. For every person we can derive the speaking state by the number of associated observations.  ... 
doi:10.1145/1452392.1452438 dblp:conf/icmi/KhalidovFHAH08 fatcat:rkdyghmti5efpjf5ahclil5hge

Motion Features from Lip Movement for Person Authentication

M.I. Faraj, J. Bigun
2006 18th International Conference on Pattern Recognition (ICPR'06)  
This paper describes a new motion based feature extraction technique for speaker identification using orientation estimation in 2D manifolds.  ...  By projecting the 3D spatiotemporal data to 2-D planes we obtain projection coefficients which we use to evaluate the 3-D orientations of brightness patterns in TV like image sequences.  ...  Speaker verification based on audio and visual images from lip-movement give 98% correct classification which is 3-4% better than audio based speaker verification.  ... 
doi:10.1109/icpr.2006.814 dblp:conf/icpr/FarajB06 fatcat:ytjcun2rrbcsdpvrvwfkyn3s3q

Finding audio-visual events in informal social gatherings

Xavier Alameda-Pineda, Vasil Khalidov, Radu Horaud, Florence Forbes
2011 Proceedings of the 13th international conference on multimodal interfaces - ICMI '11  
To this end, we fully exploit the geometric and physical properties of an audio-visual sensor based on binocular vision and binaural hearing.  ...  We propose a new multimodal clustering algorithm based on a Gaussian mixture model, where one of the modalities (visual data) is used to supervise the clustering process.  ...  visual scene, or a speaker is occluded by another speaker/sound source.  ... 
doi:10.1145/2070481.2070527 dblp:conf/icmi/Alameda-PinedaKHF11 fatcat:m746sm43mrbyzkkjalimhys2g4

Dialocalization

Gerald Friedland, Chuohao Yeo, Hayley Hung
2010 ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)  
The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties.  ...  The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker localization at little  ...  ACKNOWLEDGMENTS We thank Adam Janin and Mary Knox for very helpful input on this article and Bao-Lan Huynh for the baseline experiments with the OpenCV face detector.  ... 
doi:10.1145/1865106.1865111 fatcat:tilcqsv3sjgcxolvngnhnhi6oa

Audio–visual person authentication using lip-motion from orientation maps

Maycel-Isaac Faraj, Josef Bigun
2007 Pattern Recognition Letters  
The XM2VTS database was used for performance quantification as it is currently the largest publicly available database (%300 persons) containing both lip-motion and speech.  ...  Since the velocities are computed without extracting the speaker's lip-contours, more robust visual features can be obtained in comparison to motion features extracted from lip-contours.  ...  Fig. 6 . 6 The suggested joint audio-visual speaker verification system.  ... 
doi:10.1016/j.patrec.2007.02.017 fatcat:bpuqxo57mzbqphqybzdbv3k4tu

Deep Audio-visual Learning: A Survey

Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, Ran He
2021 International Journal of Automation and Computing  
We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual  ...  In this paper, we provide a comprehensive survey of recent audio-visual learning development.  ...  For the audio stream, the researchers applied a neural network model to detect speech for clustering and subsequently assigned a frame cluster to the given audio cluster according to the majority principle  ... 
doi:10.1007/s11633-021-1293-0 fatcat:an5lfyf4m5fh7mlngmdcbx7joy

Audio Segmentation and Speaker Localization in Meeting Videos

H. Vajaria, T. Islam, S. Sarkar, R. Sankar, R. Kasturi
2006 18th International Conference on Pattern Recognition (ICPR'06)  
We compare our results with audio based segmentation method and our localization technique with the commonly used mutual information.  ...  In this effort, given a meeting room video, we attempt to segment individual person's speech and localize them in the video, based on data from a single audio and video source.  ...  Localization Once clusters for individual speakers are obtained, the next step is to localize the speaker in the corresponding video frames.  ... 
doi:10.1109/icpr.2006.283 dblp:conf/icpr/VajariaISSK06 fatcat:ubjxzo2ao5ev3on47ltxtw7hty

Cyberspatial audio technology

Michael Cohen, Jens Herder, William L. Martens
1999 Journal of the Acoustical Society of Japan (E)  
for such speaker array systems assume only rough speaker-placement guidelines.  ...  The red translucent cones visualize localization errors used by a clustering algorithm17) to decide which sources can be coalesced. ments like nuclear power plants, fires, toxic waste dumps, and deep mining  ...  Besides the interest in spatial audio manifested by this paper, Cohen has research interests in telecommunication semiotics and hypermedia; Herder has interests in computer graphics, software engineering  ... 
doi:10.1250/ast.20.389 fatcat:37wpewb45jgl3a3xlr6tfn47ae

Speaker Detection and Applications to Cross-Modal Analysis of Planning Meetings

Bing Fang, Yingen Xiong, Francis Quek
2009 2009 11th IEEE International Symposium on Multimedia  
In this paper, we present an approach of speaker localization using combination of visual and audio information in multimodal meeting analysis.  ...  By computing correlation of audio signals, mouth movements, and hand motion, we detect a talking person both spatially and temporally. Three kinds of features are extracted for speaker localization.  ...  In this paper, we present our visual audio-based techniques to perform speaker localization in our meeting room.  ... 
doi:10.1109/ism.2009.66 dblp:conf/ism/FangXQ09 fatcat:dhk4spssorf2xcbdq7gbosxhzy

Blind Audiovisual Source Separation Based on Sparse Redundant Representations

Anna Llagostera Casanovas, Gianluca Monaci, Pierre Vandergheynst, Rémi Gribonval
2010 IEEE transactions on multimedia  
Results show that the proposed method is able to successfully detect, localize, separate and reconstruct present audio-visual sources.  ...  Based on this co-occurrence measure, audio-visual sources are counted and located in the image using a robust clustering algorithm that groups video structures exhibiting strong correlations with the audio  ...  Video atoms synchronous with the audio track and that are spatially close are grouped together using a clustering algorithm that counts and localizes on the image plane audio-visual sources.  ... 
doi:10.1109/tmm.2010.2050650 fatcat:rusd73kyvjfbre6lgup4i366xe

Deep Audio-Visual Learning: A Survey [article]

Hao Zhu, Mandi Luo, Rui Wang, Aihua Zheng, Ran He
2020 arXiv   pre-print
We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual  ...  Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully.  ...  For the audio stream, the researchers applied a neural network model to detect speech for clustering and subsequently assigned a frame cluster to the given audio cluster according to the majority principle  ... 
arXiv:2001.04758v1 fatcat:p6ph5cujl5do3pzlpvcce35nvi

AVA-AVD: Audio-visual Speaker Diarization in the Wild [article]

Eric Zhongcong Xu, Zeyang Song, Chao Feng, Mang Ye, Mike Zheng Shou
2021 arXiv   pre-print
Audio-visual speaker diarization aims at detecting "who spoken when" using both auditory and visual signals.  ...  To overcome it, we propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility.  ...  Ava active speaker: An audio-visual modal speaker clustering in full length movies. Multimedia dataset for active speaker detection.  ... 
arXiv:2111.14448v3 fatcat:b6ayj24h4jb4hn5t2h5tsghk4e
« Previous Showing results 1 — 15 out of 3,595 results