Cross-Domain Matching with Squared-Loss Mutual Information
IEEE Transactions on Pattern Analysis and Machine Intelligence
The goal of cross-domain matching (CDM) is to find correspondences between two sets of objects in different domains in an unsupervised way. CDM has various interesting applications, including photo album summarization where photos are automatically aligned into a designed frame expressed in the Cartesian coordinate system, and temporal alignment which aligns sequences such as videos that are potentially expressed using different features. In this paper, we propose an informationtheoretic CDM
... mework based on squared-loss mutual information (SMI). The proposed approach can directly handle non-linearly related objects/sequences with different dimensions, with the ability that hyper-parameters can be objectively optimized by cross-validation. We apply the proposed method to several real-world problems including image matching, unpaired voice conversion, photo album summarization, cross-feature video and cross-domain video-to-mocap alignment, and Kinect-based action recognition, and experimentally demonstrate that the proposed method is a promising alternative to state-of-the-art CDM methods. Cross-Domain Matching with Squared-Loss Mutual Information 2 Keywords Cross-Domain Object Matching, Cross-Domain Temporal Alignment, Squared-Loss Mutual Information. Introduction Matching/alignment of objects/time-series from different domains is an important task in machine learning, data mining, and computer vision communities. Applications include photo album summarization, cross-feature video and cross-domain video-to-mocap alignment, activity recognition, temporal segmentation, and curve matching [1, 2, 3, 4, 5, 6] . In this paper, we propose a general information-theoretic cross-domain matching (CDM) framework based on squared-loss mutual information  . In particular, we address two CDM problems: cross-domain object matching and cross-domain temporal alignment. The difference between the two CDM problems is subtle. In object matching the relative ordering within the sets does not matter, where as in temporal alignment the relative ordering within each set must be preserved. Cross-Domain Object Matching (CDOM): The objective of cross-domain object matching (CDOM) is to match two sets of objects in different domains. For instance, in photo album summarization, photos are automatically assigned into a designed frame expressed in the Cartesian coordinate system (see Figure 5 (a)). A typical approach of CDOM is to find a mapping from objects in one domain (photos) to objects in the other domain (frame) so that the pairwise dependency is maximized. In this scenario, accurately evaluating the dependence between objects is the key challenge. Kernelized sorting (KS)  tries to find a mapping between two domains that maximizes mutual information (MI)  under the Gaussian assumption. However, since the Gaussian assumption may not be fulfilled in practice, this method (which we refer to as KS-MI) tends to perform poorly. To overcome the limitation of KS-MI, Quadrianto et al.  proposed using the kernel-based dependence measure called the Hilbert-Schmidt independence criterion (HSIC)  for KS. Since HSIC is a distribution-free independence measure, KS with HSIC (which we refer to as KS-HSIC) is more flexible than KS-MI. However, HSIC includes the Gaussian kernel width as a tuning parameter, and its choice is crucial in obtaining desired performance (see also  ). In this paper, we propose an alternative CDOM method that can naturally address the model selection problem. The proposed method, called least-squares object matching (LSOM), employs squared-loss mutual information (SMI)  as the dependence measure. An advantage of LSOM is that cross-validation (CV) with respect to the SMI criterion is possible. Thus, all the tuning parameters such as the Gaussian kernel width and the regularization parameter can be objectively determined by CV. Through experiments on image matching, unpaired voice conversion, and photo album summarization tasks, LSOM is shown to be a promising alternative to CDOM, outperforming competing methods. Cross-Domain Matching with Squared-Loss Mutual Information 3 Cross-Domain Temporal Alignment (CDTA): Temporal alignment of sequences is an important problem with many practical applications such as speech recognition [11, 12] , activity recognition , temporal segmentation , curve matching , chromatographic and micro-array data analysis  , synthesis of human motion  , and temporal alignment of human motion [3, 15] . Dynamic time warping (DTW) is a classical temporal alignment method that aligns two sequences by minimizing the pairwise distance [11, 12] between samples (e.g., under the Euclidean, squared Euclidean, or Manhattan distance measures). An advantage of DTW is that the minimization can be efficiently carried out by dynamic programming (DP).  . However, due to the typical fixed sample-wise notion of distance, DTW may not be able to find a good alignment where two signals are related in complex ways (e.g., a video and negative of the video are perceptually similar but would result in large sampleto-sample distance and DTW score). Moreover, DTW cannot handle sequences with different dimensions (e.g., video to audio alignment), which limits the range of applications significantly. Even if the dimensionality is the same, it is not clear which distance measure is the most appropriate for a given application. To overcome the weaknesses of DTW, canonical time warping (CTW) was introduced in . CTW performs sequence alignment in a common latent space found by canonical correlation analysis (CCA)  . Thus, CTW can naturally handle sequences with different dimensions. However, CTW can only deal with linear subspace projections, and it is difficult to optimize model parameters, such as the regularization parameter used in CCA and the dimensionality of the common latent space. To handle non-linearity, dynamic manifold temporal warping (DMTW) was recently proposed in  . DMTW first projects original data onto a one-dimensional non-linear manifold and then finds an alignment on this manifold using DTW. Although DMTW is highly flexible by construction, its performance depends heavily on the choice of the non-linear transformation and, moreover, it implicitly assumes the smoothness of sequences. In this paper, we propose a novel information-theoretic CDTA method based on dependence maximization. Our method, which we call least-squares dynamic time warping (LSDTW), employs SMI as a dependency measure. Our method can naturally deal with non-linearity and non-Gaussianity in data and CV is available for model selection. Furthermore, LSDTW does not require strong assumptions on the topology of the latent manifold (e.g., smoothness). Thus, LSDTW is expected to perform well in a broader range of applications. Through experiments on synthetic data, video sequence alignment, and Kinect action recognition tasks, LSDTW is shown to be a promising alternative to existing temporal alignment methods. Preliminary version of this work appeared in  which only focused on SMI-based CDOM. In this journal version, we further explore SMI-based CDTA and provide a more extensive experimental evaluation. Cross-Domain Matching with Squared-Loss Mutual Information 2 Squared-Loss Mutual Information We first review squared-loss mutual information (SMI)  .