Incremental Learning of Concept Drift from Streaming Imbalanced Data
IEEE Transactions on Knowledge and Data Engineering
Learning in nonstationary environments, also known as learning concept drift, is concerned with learning from data whose statistical characteristics change over time. Concept drift is further complicated if the dataset is class-imbalanced. While these two issues have been independently addressed, their joint treatment has been mostly underexplored. We describe two ensemble-based approaches for learning concept drift from imbalanced data. Our first approach is a logical combination of our
... ation of our previously introduced Learn ++ .NSE algorithm for concept drift, with the well-established SMOTE for learning from imbalanced data. Our second approach makes two major modifications to Learn ++ .NSE-SMOTE integration by replacing SMOTE with a sub-ensemble that makes strategic use of minority class data; and replacing Learn ++ .NSE and its class-independent error weighting mechanism with a penalty constraint that forces the algorithm to balance accuracy on all classes. The primary novelty of this approach is in determining the voting weights for combining ensemble members, based on each classifier's time and imbalance-adjusted accuracy on current and past environments. Favorable results in comparison to other approaches indicate that both approaches are able to address this challenging problem, each with its own specific areas of strength. We also release all experimental data as a resource and benchmark for future research. Index Terms-incremental learning, concept drift, class imbalance, multiple classifier systems. -------------------- INTRODUCTION omputational models of learning are typically developed for a particular problem domain, and optimized for specific conditions within that domain. These conditions usually dictate or restrict the amount and nature of the available data for training, the distributions from which such data are drawn, or the mechanism by which data become available, any of which can make it difficult to address multiple problems domains concurrently. The two problem domains featured in this paper, namely learning concept drift (i.e., learning in nonstationary environments) and learning from imbalanced data (i.e., with very few positive and many negative instances), are good examples, as there are well-established approaches for each. Many recent efforts -by us as well as other researchers -have separately focused on concept drift and class imbalance. A more general learning framework for accommodating the joint problem, that is, learning from a drifting (nonstationary) environment that also provides severely unbalanced data, is largely underexplored. With the omnipresence of realworld applications, such as climate monitoring, spam filtering, or fraud detection, the importance of developing a more general framework can hardly be overstated. For example, in spam identification problem, an official work related email address may receive many legitimate and few spam e-mails. The goal is then to identify the minority class (spam) so that they can be removed. Conversely, a personal e-mail address may receive a large number of spams, but few work related e-mails, where the goal is then to identify the minority class (work related) e-mails so that they can be saved. Both cases are also concept drift problems, as the characteristics of both spam and legitimate e-mails change over time in part due to increasingly creative techniques used by spammers, and in part due to changing trends in user interest. Hence, this is an example of the joint problem of incremental learning of concept drift from class-imbalanced data. C Combining the definition of incremental learning, as suggested by several authors [1-3], along with Kuncheva's and Bifet's desiderata for nonstationary learning algorithms [4;5], we obtain the desired properties of a general framework for learning concept drift from imbalanced data as follows: (i) Learning new knowledge: building upon the current model using new data to learn novel knowledge in a wide spectrum of nonstationary environments; (ii) Preserving previous knowledge: determining what previous knowledge is still relevant (and hence should be preserved), what is no longer relevant (hence should be discarded / forgotten), but with the added ability to recall discarded information if the drift / change follow a cyclical nature; (iii) One pass (incremental) learning: learning one instance or one batch at a time without requiring access to previously seen data; and (iv) Balance on minority /majority class performance: maintaining high accuracy (recall) on minority class without sacrificing majority class performance. This paper describes such a framework that includes two related ensemble-based incremental learning approaches, namely, Learn ++ .CDS and Learn ++ .NIE, neither of which place any restrictions on how slow, fast, abrupt, gradual, or cyclical the change in distributions may be. Both approaches are also designed to handle class imbalance, and are able to learn from new data that become available in batches, without requiring access to data seen in previous batches. The streaming and nonstationary nature of data strictly require incremental learning, which raises the so-called stability-plasticity dilemma, where "stability" describes retaining existing knowledge (for learning stationary subspaces or remembering recurring nonstationary distributions), and "plasticity" refers to learning new knowledge  . We show that an ensemble-ofclassifiers based learning model that use carefully selected instances with strategically and dynamically assigned weights for combining member classifiers can indeed learn in such a nonstationary environment, and achieve a meaningful balance of stability and plasticity, even in the presence of class imbalance. The primary contribution of this paper is such a general framework for learning from a stream of class-imbalanced data whose underlying distributions may be changing over time. This work complements our prior work on Learn ++ .NSE (incremental learning for Non Stationary Environments) algorithm for learning concept drift. Learn ++ .NSE trains a new classifier for each new batch of data, combining them using dynamically weighted majority voting, where voting weights are based on classifiers' time-adjusted errors averaged over recent environments . However, Learn ++ .NSE, like other algorithms not specifically designed to accommodate class-imbalance, becomes biased towards majority class in case of severe class imbalance. Two approaches are presented in this paper to develop a model that can learn concept drift from imbalanced data. The first is a natural combination Learn ++ .NSE with the Synthetic Minority class Oversampling TEchnique (SMOTE), a well-established over-sampling approach that generates strategically positioned synthetic minority data. The second approach replaces Learn ++ .NSE and its class-independent raw classification error with a new penalty con-This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.