A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is
As a key characteristic in audio-visual speech recognition (AVSR), relating linguistic information observed across visual and audio data has been a challenge, benefiting not only audio/visual speech recognition (ASR/VSR) but also for manipulating data within/across modalities. In this paper, we present a feature disentanglement-based framework for jointly addressing the above tasks. By advancing cross-modal mutual learning strategies, our model is able to convert visual or audio-baseddoi:10.1609/aaai.v36i3.20210 fatcat:7nw3ixacvfdqrott6gh7fclx5m