A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
2018
IEEE Transactions on Pattern Analysis and Machine Intelligence
Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. ...
The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each ...
Fig. 1 : 1 The Bayesian spatiotemporal fusion model used for audio-visual speaker diarization. Shaded nodes represent the observed variables, while unshaded nodes represent latent variables. ...
doi:10.1109/tpami.2017.2648793
pmid:28103192
fatcat:cn6tcdf5n5dp7leyrrrevtxln4
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
2019
Sensors
In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. ...
A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. ...
[33] proposed multimodal speaker diarization based on spatiotemporal Bayesian fusion, where a supervised localization technique is used to map audio features onto the image. ...
doi:10.3390/s19235163
pmid:31775385
pmcid:PMC6929047
fatcat:5uncqvakpzbqzdsakafilsgbbu
An efficient audiovisual saliency model to predict eye positions when looking at conversations
2015
2015 23rd European Signal Processing Conference (EUSIPCO)
Classic models of visual attention dramatically fail at predicting eye positions on visual scenes involving faces. ...
This model includes a speaker diarization algorithm which automatically modulates the saliency of conversation partners' faces and bodies according to their speaking-or-not status. ...
Based on these results, we proposed an audiovisual saliency model including a speaker diarization algorithm able to automatically spot "who speaks when" [10] . ...
doi:10.1109/eusipco.2015.7362640
dblp:conf/eusipco/CoutrotG15
fatcat:k4pma5eskrcrrknzwidi4llpzu
Guest Editors' Introduction to the Special Section on Learning with Shared Information for Computer Vision and Multimedia Analysis
2018
IEEE Transactions on Pattern Analysis and Machine Intelligence
The paper "Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion" by I. D. Gebru, S. Ba, X. Li, and R. ...
The authors propose to combine multiple-person visual tracking with multiple speech-source localization in a principled spatiotemporal Bayesian fusion model. ...
doi:10.1109/tpami.2018.2804998
fatcat:urin3tvgy5f7ng5djfvlm4mop4
Ego4D: Around the World in 3,000 Hours of Egocentric Video
[article]
2022
arXiv
pre-print
, audio-visual conversation, and social interactions), and future (forecasting activities). ...
Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. ...
Thank you to the Common Visual Data Foundation (CVDF) for hosting the Ego4D dataset. ...
arXiv:2110.07058v3
fatcat:lgh27km63nhcdcpkvbr2qarsru
Program
2021
2021 National Conference on Communications (NCC)
From 1983-2003, he was on the faculty of the EE-Systems Department at USC. Since 2003, he has been on the faculty of IISc, Bengaluru. ...
Finally, we will touch on certain security issues for these applications. ...
Speaker: Prof. Preeti Rao is on the faculty of Electrical Engineering at I.I.T. Bombay, in the area of signal processing for speech and audio. ...
doi:10.1109/ncc52529.2021.9530194
fatcat:ahdw5ezvtrh4nb47l2qeos3dwq
Social Interactions Analysis through Deep Visual Nonverbal Features
2021
Our method shows improved results even as compared to multi-modal non-verbal features extracted from audio and visual. ...
From the computing perspective, we propose visual activity-based nonverbal feature extraction from video streams by applying a deep learning approach along with the feature encoding for low dimensional ...
These audio-visual modality based methods are realised on joint modeling of speech, facial and body cues or can be based on speaker diarization while video is mainly used to track/localize the person to ...
doi:10.15167/shahid-muhammad_phd2021-03-05
fatcat:dejp3ms6ujdvndzf7svgkl6h4a
Συστήματα επεξεργασίας οπτικοακουστικών πόρων (Media Assets)
2018
of audio. ...
A new method for audiovisual content segmentation based on speech / music discrimination is developed. ...
«Audio-visual speaker diarization based on spatiotemporal bayesian
fusion». ...
doi:10.26262/heal.auth.ir.301464
fatcat:gq44joju5nan5nhl6eilc4wx7q
Multimodal Visual Sensing: Automated Estimation of Engagement
[article]
2021
This dissertation presents a general framework based on multimodal visual sensing to analyze engagement and related tasks from visual modalities. ...
Beneficial uses include classroom analytics to measure teaching quality and the development of interventions to improve teaching based on these analytics, as well as presentation analysis to help students ...
From eye-tracking data, they found that human observers mainly concentrate on speakers' faces when viewing audio-visual recordings, but concentrate on speakers' bodies and gestures when viewing visual-only ...
doi:10.15496/publikation-55003
fatcat:jcki2sjwjzfbvm5vdanxsy7ohy