Learning to segment moving objects in videos

Katerina Fragkiadaki, Pablo Arbelaez, Panna Felsen, Jitendra Malik
2015 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
We present a method that segments moving objects in videos by ranking spatio-temporal region proposals according to "moving objectness"; how likely they are to contain a moving object. Region proposal generation and ranking using an object detector is currently the dominant paradigm for object detection in the static image domain [7] . It has shown excellent performance against sliding window classifiers or Markov Random Field based pixel classification that cannot distinguish closeby instances
more » ... h closeby instances of the same object class [4] . In this paper, we propose a similar paradigm for detecting moving objects in videos and present large quantitative advances over previous multiscale segmentation and trajectory clustering methods. video frame optical flow flow boundaries best moving object proposal best static object proposal image boundaries Figure 1: Per frame moving object proposals. Static segment proposals fail to capture the dancer as a whole due to internal clothing contours. Flow boundaries suffer less from albedo or shading edges in object interiors. Segmentation on them correctly delineates the dancer. In each video frame, we compute segment proposals using multiple figure-ground segmentations on per frame motion boundaries. We extract motion boundaries by applying the learning based boundary detector of [3] on the magnitude of optical flow. The extracted boundaries establish pixel affinities for multiple figure-ground segmentations [9] that generate a pool of segment proposals; we call them per frame Moving Object Proposals (MOPs). Our per frame MOPs increase the object detection rate up to 7% over previous state-of-the-art static proposals and demonstrate the value of motion for object detection in videos. Objects, however, are not constantly in motion. At frames when they are static, there are no optical flow boundaries and MOPs miss them. Thus, we extend MOPs to spatio-temporal tubes using random walkers on dense point trajectory motion affinities. We rank per frame segment and spatio-temporal tube proposals with a Moving Objectness Detector (MOD) that learns to detect moving objects from a set of training examples. Our MOD has a dual-pathway CNN architecture that operates on both RGB and flow fields. It outperforms handcoded center-surround saliency and other competitive multilayer objectness baselines [8] . Our method bridges the gap between motion segmentation and tracking methods. Previous motion segmenters [11, 12] operate "bottom-up", they exploit color or motion cues without using a training set of objects. Previous This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage. trackers [1, 6] use object detectors (e.g., car or pedestrian detector) to cast attention to the relevant parts of the scene. We do use a training set for learning the concept of a moving object, yet remain agnostic to the exact object classes present in the video.
doi:10.1109/cvpr.2015.7299035 dblp:conf/cvpr/FragkiadakiAFM15 fatcat:ysazkx3e3va3jpw6tyznpm3eay