Natural Language Descriptions for Human Activities in Video Streams

Nouf Alharbi, Yoshihiko Gotoh
2017 Proceedings of the 10th International Conference on Natural Language Generation  
Reuse This article is distributed under the terms of the Creative Commons Attribution (CC BY) licence. This licence allows you to distribute, remix, tweak, and build upon the work, even commercially, as long as you credit the authors for the original work. More information and the full terms of the licence here: Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing
more » ... ding the URL of the record and the reason for the withdrawal request. Abstract There has been continuous growth in the volume and ubiquity of video material. It has become essential to define video semantics in order to aid the searchability and retrieval of this data. We present a framework that produces textual descriptions of video, based on the visual semantic content. Detected action classes rendered as verbs, participant objects converted to noun phrases, visual properties of detected objects rendered as adjectives and spatial relations between objects rendered as prepositions. Further, in cases of zero-shot action recognition, a language model is used to infer a missing verb, aided by the detection of objects and scene settings. These extracted features are converted into textual descriptions using a template-based approach. The proposed video descriptions framework evaluated on the NLDHA dataset using ROUGE scores and human judgment evaluation.
doi:10.18653/v1/w17-3512 dblp:conf/inlg/HarbiG17 fatcat:y25fk3r525dm3gay277lknepta