Object Referring in Videos with Language and Human Gaze

Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many
more » ... uch cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previous OR methods. For dataset and code, please refer https: //people.ee.ethz.ch/˜arunv/ORGaze.html.
doi:10.1109/cvpr.2018.00434 dblp:conf/cvpr/VasudevanDG18 fatcat:ybr34relevhgvoby2pdquu6szm