Deep Learning for Video Captioning: A Review

Shaoxiang Chen, Ting Yao, Yu-Gang Jiang
2019 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence  
Deep learning has achieved great successes in solving specific artificial intelligence problems recently. Substantial progresses are made on Computer Vision (CV) and Natural Language Processing (NLP). As a connection between the two worlds of vision and language, video captioning is the task of producing a natural-language utterance (usually a sentence) that describes the visual content of a video. The task is naturally decomposed into two sub-tasks. One is to encode a video via a thorough
more » ... standing and learn visual representation. The other is caption generation, which decodes the learned representation into a sequential sentence, word by word. In this survey, we first formulate the problem of video captioning, then review state-of-the-art methods categorized by their emphasis on vision or language, and followed by a summary of standard datasets and representative approaches. Finally, we highlight the challenges which are not yet fully understood in this task and present future research directions.
doi:10.24963/ijcai.2019/877 dblp:conf/ijcai/ChenYJ19 fatcat:3xxssrzqjjd5jbvtgkkp5lw7xa