Online Dictionary Learning for Kernel LMS

2014 IEEE Transactions on Signal Processing  
Adaptive filtering algorithms operating in reproducing kernel Hilbert spaces have demonstrated superiority over their linear counterpart for nonlinear system identification. Unfortunately, an undesirable characteristic of these methods is that the order of the filters grows linearly with the number of input data. This dramatically increases the computational burden and memory requirement. A variety of strategies based on dictionary learning have been proposed to overcome this severe drawback.
more » ... the literature, there is no theoretical work that strictly analyzes the problem of updating the dictionary in a time-varying environment. In this paper, we present an analytical study of the convergence behavior of the Gaussian least-mean-square algorithm in the case where the statistics of the dictionary elements only partially match the statistics of the input data. This theoretical analysis highlights the need for updating the dictionary in an online way, by discarding the obsolete elements and adding appropriate ones. We introduce a kernel least-mean-square algorithm with 1-norm regularization to automatically perform this task. The stability in the mean of this method is analyzed, and the improvement of performance due to this dictionary adaptation is confirmed by simulations. where κ j is the (N × 1) vector with i-th entry κ(u i , u j ). Online processing of time series data raises the question of how to process an increasing amount N of observations as new data is collected. Indeed, an undesirable characteristic of problem (1)-(2) is that the order of the filters grows linearly with the number of input data. This dramatically increases the computational burden and memory requirement of nonlinear system identification methods. To overcome this drawback, several authors have focused on fixed-size models of the form We call D = {κ(·, u ωj )} M j=1 the dictionary, which has to be learnt from input data, and M the order of the kernel expansion by analogy with linear transversal filters. The subscript ω j allows us to clearly distinguish dictionary elements u ωj , . . . , u ω M from input data u n . Online identification of kernel-based models generally relies on a two-step process at each iteration: a model order control step that updates the dictionary, and a parameter update step. This two-step process is the essence of most adaptive filtering techniques with kernels [16] . Based on this scheme, several state-of-the-art linear methods were reconsidered to derive powerful nonlinear generalizations operating in high-dimensional RKHS [17], [18]. On the one hand, the kernel recursive least-squares algorithm (KRLS) was introduced in [19]. It can be seen as a kernel-based counterpart of the RLS algorithm, and it is characterized by a fast convergence speed at the expense of a quadratic computational complexity in M . The sliding-window KRLS and extended KRLS algorithms were successively derived in [20], [21] to improve to tracking ability of the KRLS algorithm. More recently, the KRLS tracker algorithm was introduced in [22], with ability to forget past information using forgetting strategies. This allows the algorithm to track non-stationary input signals based on the idea of the exponentially-weighted KRLS algorithm [16]. On the other hand, the kernel affine projection algorithm (KAPA) and, as a particular case, the kernel normalized LMS algorithm (KNLMS), were independently introduced in [23]-[26]. The kernel least-mean-square algorithm (KLMS) was presented in [27], [28], and has attracted substantial research interest because of its linear computational complexity in M , superior tracking ability and robustness. It however converges more slowly than the KRLS algorithm. The KAPA algorithm has intermediate characteristics between the KRLS and KLMS algorithms in terms of convergence speed, computational complexity and tracking ability. A very detailed analysis of the stochastic behavior of the KLMS algorithm with Gaussian kernel was provided in [29], and a closed-form condition for convergence was recently introduced in [30]. The quantized KLMS algorithm (QKLMS) was proposed in [31], and the QKLMS algorithm with 1 -norm regularization was introduced in [32]. Note that the latter uses 1 -norm in order to sparsify the parameter vector α in the kernel expansion (3). A subgradient approach was considered to accomplish this task, which contrasts with the more efficient forward-backward splitting algorithm recommended in [33], [34]. A recent trend within the area of adaptive filtering with kernels consists of extending all the algorithms to give them the ability to process complex input signals [35], [36]. The convergence analysis of the complex KLMS algorithm with Gaussian kernel presented in [37] is a direct application of the derivations in [29]. Finally, the quaternion kernel least-squares algorithm was recently introduced in [38]. All the above-mentioned methods use different learning strategies to decide, at each time instant n, whether κ(·, u n ) This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.4 deserves to be inserted into the dictionary or not. One of the most informative criteria uses the so-called approximate linear dependency (ALD) condition. To ensure the novelty of a candidate for becoming a new dictionary element, this criterion checks that it cannot be well approximated as a linear combination of the samples κ(·, u ωj ) that are already in the dictionary [19] . Other well-known criteria include the novelty criterion [39] , the coherence criterion [24] , the surprise criterion [40] , and closed-ball sparsification criterion [41] . Without loss of generality, we focus on the KLMS algorithm with coherence criterion due to its simplicity and effectiveness, and because its performance are well described and understood by theoretical models [29] , [30] that are exploited here. However, the dictionary update procedure studied in this paper can be adapted to the above-mentioned filtering algorithms and sparsification criteria without too much effort. Except the above-mentioned works [32], [33] , most of the existing strategies for dictionary update are only able to incorporate new elements into the dictionary, and to possibly forget the old ones using a forgetting factor. This means that they cannot automatically discard obsolete kernel functions, which may be a severe drawback within the context of a time-varying environment. Recently, sparsity-promoting regularization was considered within the context of linear adaptive filtering. All these works propose to use, either the 1 -norm of the vector of filter coefficients as a regularization term, or some other related regularizers to limit the bias relative to the unconstrained solution. The optimization procedures consist of subgradient descent [42], projection onto the 1 -ball [43], or online forward-backward splitting [44] . Surprisingly, this idea was little used within the context of kernel-based adaptive filtering. To the best of our knowledge, only [33] uses projection for least-squares minimization with weighted block 1 -norm regularization, within the context of multi-kernel adaptive filtering. There is no theoretical work that analyzes the necessity of updating the dictionary in a time-varying environment. In this paper, we present an analytical study of the convergence behavior of the Gaussian least-mean-square algorithm in the case where the statistics of the dictionary elements only partially match the statistics of the input data. This analysis highlights the need for updating the dictionary in an online way, by discarding the obsolete elements and adding appropriate ones. Thus, we introduce a KLMS algorithm with 1 -norm regularization in order to automatically perform this task. The stability of this method is analyzed and, finally, it is tested with experiments. II. BEHAVIOR ANALYSIS OF GAUSSIAN KLMS ALGORITHM WITH PARTIALLY MATCHING DICTIONARY Signal reconstruction from a redundant dictionary has been extensively addressed during the last decade [45], both theoretically and experimentally. In order to represent a signal with a minimum number of elements of a dictionary, an efficient approach is to incorporate a sparsity-inducing regularization term such as an 1 -norm one in order to select the most informative patterns. On the other hand, a classical result of adaptive filtering theory says that, as the length of LMS adaptive filters increases, their mean-square estimation error increases and their convergence speed decreases [18] . This suggests to discard obsolete dictionary elements of KLMS adaptive filters in order to improve their performance in non-stationary environments. To check this property formally, we shall now analyze the behavior of the KLMS algorithm with Gaussian kernel depicted in [29] in the case where a given proportion of the dictionary elements has distinct stochastic properties from the input samples. No theoretical work has been This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.
doi:10.1109/tsp.2014.2318132 fatcat:jxl5374myjcbxifxh2griwnuie