9 Hits in 3.7 sec

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Israel D. Gebru, Sileye Ba, Xiaofei Li, Radu Horaud
2018 IEEE Transactions on Pattern Analysis and Machine Intelligence  
Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed.  ...  The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each  ...  Fig. 1 : 1 The Bayesian spatiotemporal fusion model used for audio-visual speaker diarization. Shaded nodes represent the observed variables, while unshaded nodes represent latent variables.  ... 
doi:10.1109/tpami.2017.2648793 pmid:28103192 fatcat:cn6tcdf5n5dp7leyrrrevtxln4

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

Ahmad, Zubair, Alquhayz, Ditta
2019 Sensors  
In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization.  ...  A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization.  ...  [33] proposed multimodal speaker diarization based on spatiotemporal Bayesian fusion, where a supervised localization technique is used to map audio features onto the image.  ... 
doi:10.3390/s19235163 pmid:31775385 pmcid:PMC6929047 fatcat:5uncqvakpzbqzdsakafilsgbbu

An efficient audiovisual saliency model to predict eye positions when looking at conversations

Antoine Coutrot, Nathalie Guyader
2015 2015 23rd European Signal Processing Conference (EUSIPCO)  
Classic models of visual attention dramatically fail at predicting eye positions on visual scenes involving faces.  ...  This model includes a speaker diarization algorithm which automatically modulates the saliency of conversation partners' faces and bodies according to their speaking-or-not status.  ...  Based on these results, we proposed an audiovisual saliency model including a speaker diarization algorithm able to automatically spot "who speaks when" [10] .  ... 
doi:10.1109/eusipco.2015.7362640 dblp:conf/eusipco/CoutrotG15 fatcat:k4pma5eskrcrrknzwidi4llpzu

Guest Editors' Introduction to the Special Section on Learning with Shared Information for Computer Vision and Multimedia Analysis

Trevor Darrell, Christoph Lampert, Nicu Sebe, Ying Wu, Yan Yan
2018 IEEE Transactions on Pattern Analysis and Machine Intelligence  
The paper "Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion" by I. D. Gebru, S. Ba, X. Li, and R.  ...  The authors propose to combine multiple-person visual tracking with multiple speech-source localization in a principled spatiotemporal Bayesian fusion model.  ... 
doi:10.1109/tpami.2018.2804998 fatcat:urin3tvgy5f7ng5djfvlm4mop4

Ego4D: Around the World in 3,000 Hours of Egocentric Video [article]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan (+73 others)
2022 arXiv   pre-print
, audio-visual conversation, and social interactions), and future (forecasting activities).  ...  Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event.  ...  Thank you to the Common Visual Data Foundation (CVDF) for hosting the Ego4D dataset.  ... 
arXiv:2110.07058v3 fatcat:lgh27km63nhcdcpkvbr2qarsru


2021 2021 National Conference on Communications (NCC)  
From 1983-2003, he was on the faculty of the EE-Systems Department at USC. Since 2003, he has been on the faculty of IISc, Bengaluru.  ...  Finally, we will touch on certain security issues for these applications.  ...  Speaker: Prof. Preeti Rao is on the faculty of Electrical Engineering at I.I.T. Bombay, in the area of signal processing for speech and audio.  ... 
doi:10.1109/ncc52529.2021.9530194 fatcat:ahdw5ezvtrh4nb47l2qeos3dwq

Social Interactions Analysis through Deep Visual Nonverbal Features

Our method shows improved results even as compared to multi-modal non-verbal features extracted from audio and visual.  ...  From the computing perspective, we propose visual activity-based nonverbal feature extraction from video streams by applying a deep learning approach along with the feature encoding for low dimensional  ...  These audio-visual modality based methods are realised on joint modeling of speech, facial and body cues or can be based on speaker diarization while video is mainly used to track/localize the person to  ... 
doi:10.15167/shahid-muhammad_phd2021-03-05 fatcat:dejp3ms6ujdvndzf7svgkl6h4a

Συστήματα επεξεργασίας οπτικοακουστικών πόρων (Media Assets)

Νικόλαος K. Τσίπας
of audio.  ...  A new method for audiovisual content segmentation based on speech / music discrimination is developed.  ...  «Audio-visual speaker diarization based on spatiotemporal bayesian fusion».  ... 
doi:10.26262/ fatcat:gq44joju5nan5nhl6eilc4wx7q

Multimodal Visual Sensing: Automated Estimation of Engagement [article]

Ömer Sümer, Universitaet Tuebingen, Kasneci, Enkelejda (Prof. Dr.)
This dissertation presents a general framework based on multimodal visual sensing to analyze engagement and related tasks from visual modalities.  ...  Beneficial uses include classroom analytics to measure teaching quality and the development of interventions to improve teaching based on these analytics, as well as presentation analysis to help students  ...  From eye-tracking data, they found that human observers mainly concentrate on speakers' faces when viewing audio-visual recordings, but concentrate on speakers' bodies and gestures when viewing visual-only  ... 
doi:10.15496/publikation-55003 fatcat:jcki2sjwjzfbvm5vdanxsy7ohy