A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
[article]
2020
arXiv
pre-print
In this paper, we introduce a new problem, named audio-visual video parsing, which aims to parse a video into temporal event segments and label them as either audible, visible, or both. Such a problem is essential for a complete understanding of the scene depicted inside a video. To facilitate exploration, we collect a Look, Listen, and Parse (LLP) dataset to investigate audio-visual video parsing in a weakly-supervised manner. This task can be naturally formulated as a Multimodal Multiple
arXiv:2007.10558v1
fatcat:kcexne6cpbe2tfyimttmercbka