Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. The well established way of doing so is to rely on encoder-decoder paradigm by learning to encode the input video and decode the variable-length output sentence in a sequence to sequence manner. Nevertheless, these approaches often fail to produce complex and descriptive sentences as natural as those from human being, since the
... els are incapable of memorizing all visual contents and syntactic structures in the human-annotated video-sentence pairs. In this paper, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model. Specifically, for each query video, a video-sentence retrieval model is first utilized to fetch semantically relevant sentences from the training sentence pool, coupled with the corresponding training videos. RAM then writes the relevant video-sentence pairs into memory and reads the memorized visual contents/syntactic structures in video-sentence pairs from memory to facilitate the word prediction at each timestep. Furthermore, we present Retrieval Augmented Convolutional Encoder-Decoder Networks (R-ConvED) that novelly integrates RAM into convolutional encoder-decoder structure to boost video captioning. Extensive experiments on MSVD, MSR-VTT, ActivityNet Captions and VATEX datasets validate the superiority of our proposals and demonstrate quantitatively compelling results.