A Simple Baseline for Audio-Visual Scene-Aware Dialog

Idan Schwartz, Alexander G. Schwing, Tamir Hazan
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our
more » ... d differentiates in a datadriven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual sceneaware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr. Recent work on audio-visual scene aware dialog [2, 25] partly addresses this shortcoming and proposes a novel Question: what color is the rag ? Answer: it appears to be white . MultiModal-Attention: Question: where is the video taking place ? MultiModal-Attention: Answer: the video starts with a man in the kitchen . Question:does he speak at all ? Answer: no he does not speak . MultiModal-Attention: Question: do they get up from the chair? MultiModal-Attention: Answer: no , they stay sitting in the chair .
doi:10.1109/cvpr.2019.01283 dblp:conf/cvpr/SchwartzSH19 fatcat:ghprtqitdbggtfxpejg4yfcjyq