The Conversation: Deep Audio-Visual Speech Enhancement [article]

Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
<span title="2018-06-19">2018</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and
more &raquo; ... n during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1804.04121v2">arXiv:1804.04121v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/bci7qoekrjfafchckb2xc55zqe">fatcat:bci7qoekrjfafchckb2xc55zqe</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20191016134946/https://arxiv.org/pdf/1804.04121v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/ee/ac/eeac49b55c8dcb0dd3b5ee27b0ecc27e7e8afcdf.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1804.04121v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>