A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Leveraging large-scale unlabeled web videos such as instructional videos for pre-training followed by task-specific finetuning has become the de facto approach for many videoand-language tasks. However, these instructional videos are very noisy, the accompanying ASR narrations are often incomplete, and can be irrelevant to or temporally misaligned with the visual content, limiting the performance of the models trained on such data. To address these issues, we propose an improveddoi:10.18653/v1/2021.naacl-main.193 fatcat:vs4ilns45bcyxhfjwjal3j2puq