Low-Rank Spatio-Temporal Video Segmentation

Alasdair Newson, Mariano Tepper, Guillermo Sapiro
2015 Procedings of the British Machine Vision Conference 2015  
Recently, a great deal of interest has been generated by the technique known as Robust Principle Component Analysis (RPCA) of Candès et al. [1] , which addresses the problem of separating a matrix into a low-rank and a sparse component. This very general formulation can be used for tasks such as background estimation in videos and face recognition. In the case of background estimation, the low-rank matrix models the background, and the sparse matrix corresponds to the foreground. A considerable
more » ... drawback of this approach is its poor robustness to local lighting conditions. If lighting conditions vary locally, one of two things may happen. Either the method incorporates the lighting variation into the foreground, which is clearly undesirable, or the rank of the background model is allowed to increase. Unfortunately, this second option means that the true foreground is likely to become included in the background, especially for objects which are static for a short while. Here, we propose to model the background as a piece-wise low-rank matrix. In this manner, it will be possible to extract several localised models which correspond to coherent lighting conditions. However, for this we need to segment the input video into such coherent regions. We refer to this problem as a low-rank spatio-temporal video segmentation. We present an algorithm to address this segmentation problem, based on region merging and spectral clustering techniques. We show that by carrying out a local RPCA in each region, the results of foreground/background separation are greatly improved, in comparison with both the standard RPCA and several other well-known background estimation techniques. Let X ∈ R m×n represent an input video, in matrix form. Each frame contains m pixels, and there are a total of n frames in our video. The goal of RPCA is to decompose X as X ≈ L + S, where L is the low-rank matrix and S is the sparse matrix. Unfortunately, the rank of a matrix is a non-convex function, so a surrogate function, the nuclear norm is used. Thus, the background/foreground separation problem may be formulated as follows: where L * = ∑ i σ i (L) is the nuclear norm of L and σ i (L) is the i th singular value of L. The scalars λ * and λ are optimisation parameters, · F is the Frobenius matrix norm and · 1 is the 1 matrix norm, which induces sparsity in the foreground matrix. To segment X into different regions where the low-rank requirement is respected, we start by creating a regular 3D grid, which we denote with Ω, on the video domain. Each Ω i corresponds to a rectangular cuboid of video information. We then create an undirected, weighted graph where each node represents a region Ω i , and a node is connected with a 6connectivity to the regions around it. Our goal will be to cluster this graph using spectral clustering techniques. The main challenge here is to design a cost function which shows how "coherent" two regions are in terms of their low-rank background representation. More formally, consider two regions to merge, Ω i and Ω j . We wish to see whether it is better to decompose the regions separately or jointly. The decomposition of Ω i will be Ω i ≈ L i + S j , and similarly for Ω j . Our first observation is that it is easier to compare the coherence of the decompositions resulting from a rank-constrained version of Equation (1) : subject to rank(L) ≤ r. The comparisons are made clearer because the λ * parameter is removed and replaced with one which is more easily interpretable, the maximum rank of each local model, r. Once the decompositions of Ω i , Ω j and Two frames from a video with locally varying lighting Foreground detection using standard RPCA (foreground in green) tim e Create graph Cluster graph Segmentation into regions with locally low-rank background Foreground detection using (proposed) local RPCA Figure 1: Illustration and results of the algorithm Ω i∪ j are obtained in this manner, we can calculate the cost of merging the two regions. Let e i = X i − L i − S i 2 F be the quadratic error of the decomposition of Ω i , and similarly for Ω j and Ω i∪ j . Our cost function is: where φ i∪ j is a scalar. Once we have established the cost of merging two regions, we convert it into a similarity cost, and cluster the resulting graph using robust spectral clustering techniques [5] . Figure 1 illustrates the problems caused by locally varying lighting conditions: either the foreground is merged into the background (second row, left), or the global (standard) RPCA is not able to represent local lighting changes (second row, right). This is corrected by segmenting the video, and carrying out a local RPCA in each region. We compare our algorithm qualitatively and quantitatively with respect to several algorithms of the literature [2, 3, 4] and find greatly improved performance in challenging situations.
doi:10.5244/c.29.103 dblp:conf/bmvc/NewsonTS15 fatcat:r2omuf5fmvcqlgkck36dat225u