A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Learning Video-Text Aligned Representations for Video Captioning
2022
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Video captioning requires that the model has the abilities of video understanding, video-text alignment, and text generation. Due to the semantic gap between vision and language, conducting video-text alignment is a crucial step to reduce the semantic gap, which maps the representations from the visual to the language domain. However, the existing methods often overlook this step, so the decoder has to directly take the visual representations as input, which increases the decoder's workload and
doi:10.1145/3546828
fatcat:3whtdi2aajh25ossrv2kz4vdzm