Multi-Class Image Labeling with Top-Down Segmentation and Generalized Robust $P^N$ Potentials

Georgios Floros, Konstantinos Rematas, Bastian Leibe
2011 Procedings of the British Machine Vision Conference 2011  
Recently, there has been increased interest in combining object class detection ("things") and texture segmentation ("stuff") for scene understanding. There is mutual benefit from such a combination. Object detectors can be improved by context ("from stuff"). In return, segmentation can be improved by semantic information provided by the object detector. Ladicky et al. [5] propose to obtain the support region for a detected object by applying GrabCut [7] on the detector bounding box. This
more » ... t segmentation introduces an additional, separate CRF segmentation step prior to the final image-level CRF segmentation, even though both decisions are based on the same color potentials. We argue that there should be only one segmentation decision made as a result of the joint inference. Furthermore, the GrabCut segmentation step ignores any specific information about the detected object class. In particular, it does not take into account how important a certain pixel was for the initial detection decision. We propose to bring in this information by feeding back soft, class-specific top-down segmentation information from the object detector for optimization in a single CRF. In this paper, this is done in the form of integrating class-specific information in the form of generalized robust higher order potentials [4]. These potentials make it possible to specify a per-pixel weight which expresses how important a pixel is for preserving object consistency. The formulation of the energy function E(y) in a higher order CRF, consisting of unary (ψ i ), pairwise (ψ i j ), and robust P N (ψ c ) potentials takes the following form which has shown state-of-the art performance in the multi-class image labeling problem [4] . While the unary and pairwise potentials are defined on the pixel level, the robust P N potentials are defined over a set of segments S . In [4], those segments are created by an unsupervised multi-level mean-shift segmentation [2]. The P N potentials introduce a cost for assigning different label classes to pixels that are part of the same segment, while taking into account the quality of the entire segment. Generalized robust P N potentials provide a structured framework for incorporating the class-specific information provided by an object detector. As introduced in [4], the per-pixel weights provide a nice interface to naturally introduce a per-pixel factor which expresses the importance of each object pixel in the preservation of object consistency. Top-down segmentations provide output from an object detector in the form of soft decisions on whether an image pixel belongs to a specific object or not. They are obtained from an extended version of the Hough Forest detector [3] . The idea behind Hough Forests is to store for each leaf node the spatial occurrence distribution (relative to the object center) of all patches that were assigned to this node. During testing, those stored locations are then used to cast probabilistic votes for the object center in a Generalized Hough Transform. As shown in [6], the votes corresponding to a local maximum in the Hough space can then be backprojected to the image in order to propagate top-down information to the patches they were originating from. We extend the Hough Forest classifier with this top-down segmentation formalism, using figure-ground labels learned from annotated training examples. Each vote v j contributing to a Hough space maximum h is backprojected to its originating patch P, augmented with a local figure-ground label Seg(v j ). We can then obtain the figure and ground probabilities for each pixel p by averaging over all patches P i containing this pixel and summing the backprojected figure-ground la-(a) (b) (c) (d) Figure 1: Top-down segmentations improve multi-class image labeling. (a) Test image with object detections. (b) Ground truth labeled image. Our algorithm uses top-down segmentations (c) to produce segmentation results (d). (Best viewed in color) bels, weighted by the weight of the corresponding vote w v j . p(p = fig|h) = 1 ∑ v j ∈h w v j ∑ P i (p)
doi:10.5244/c.25.79 dblp:conf/bmvc/FlorosRL11 fatcat:vogmuatl65cxdjpxrjkaupszva