Dynamic integration of classifiers for handling concept drift

Alexey Tsymbal, Mykola Pechenizkiy, Pádraig Cunningham, Seppo Puuronen
2008 Information Fusion  
In the real world concepts are often not stable but change with time. A typical example of this in the biomedical context is antibiotic resistance, where pathogen sensitivity may change over time as new pathogen strains develop resistance to antibiotics that were previously effective. This problem, known as concept drift, complicates the task of learning a model from data and requires special approaches, different from commonly used techniques that treat arriving instances as equally important
more » ... ontributors to the final concept. The underlying data distribution may change as well, making previously built models useless. This is known as virtual concept drift. Both types of concept drifts make regular updates of the model necessary. Among the most popular and effective approaches to handle concept drift is ensemble learning, where a set of models built over different time periods is maintained and the best model is selected or the predictions of models are combined, usually according to their expertise level regarding the current concept. In this paper we propose the use of an ensemble integration technique that would help to better handle concept drift at an instance level. In dynamic integration of classifiers, each base classifier is given a weight proportional to its local accuracy with regard to the instance tested, and the best base classifier is selected, or the classifiers are integrated using weighted voting. Our experiments with synthetic data sets simulating abrupt and gradual concept drifts and with a real-world antibiotic resistance data set demonstrate that dynamic integration of classifiers built over small time intervals or fixed-sized data blocks can be significantly better than majority voting and weighted voting, which are currently the most commonly used integration techniques for handling concept drift with ensembles. Introduction The problem of concept drift is of increasing importance to machine learning and data mining as 3 more and more data is organized in the form of data streams rather than static databases, and it is rather unusual that concepts and data distributions stay stable over a long period of time [23, 30] . Ensemble learning is among the most popular and effective approaches to handle concept drift, in which a set of concept descriptions built over different time intervals is maintained, predictions of which are combined using a form of voting, or the most relevant description is selected [13, 20, 21, 28] . However, there is a problem with current ensemble approaches; they are not able to deal with local concept drift, which is a common case with real-world data. For example, only particular bacteria may develop their resistance to certain antibiotics, while resistance to the others can remain the same; or the data distribution can change for particular bacteria depending on the season. At present, the most common integration approach with ensembles for handling concept drift is weighted voting, where each base classifier receives a weight proportional to its relevance to the current concept [13, 20, 21, 28] . With weighted voting, lower weights can be assigned to predictions from base classifiers simply because their global accuracy on the current block of data falls, even if they are still good experts in the stable parts of the data. In this paper, we consider one solution to this problem; namely replacing the integration (combination) function of the ensemble. To improve the treatment of local concept drifts, dynamic integration of classifiers can be used, which integrates base classifiers at an instance level. In dynamic integration, each base classifier receives a weight proportional to its local accuracy in the neighbourhood of the current test instance, instead of using global classification accuracy as in normal weighted voting. We consider ensemble learning with dynamic integration for handling concept drifts in the rotating hyperplane and SEA concepts data sets, representing simulated gradual and abrupt concept drifts respectively [6, 21] . Besides, we apply dynamic integration to ensembles of classifiers built in the domain of antibiotic resistance in nosocomial infections in order to better handle concept drift. Our experiments demonstrate that dynamic integration often achieves better classification accuracy than commonly used integration techniques, such as majority voting and weighted voting, on the synthetic data sets and on the problem of antibiotic resistance prediction, supporting our hypothesis that it can be a better technique for handling concept drift. The idea of the use of dynamic integration for handling concept drift was introduced by us in [26] with experiments focusing on the problem of antibiotic resistance in nosocomial infections, and in this paper we present the dynamic integration approach in the level of detail necessary for 4 possible implementation. We introduce the notion of local concept drift with a simple example, we consider more extensive experiments with data sets representing different types of concept drift, and discuss possible improvements to the dynamic integration techniques considered by comparing them to other related dynamic integration techniques. This paper is organized as follows: in Section 2 we consider the general problem of concept drift, in Section 3 we introduce the notion of local concept drift, and in Section 4 we review approaches to ensemble integration with a focus on dynamic integration. In Section 5 we consider the basic characteristics of the data sets used for analysis, in Section 6 we present the results of our experiments with the use of different ensemble integration techniques with synthetic and real-world data, in Section 7 we discuss the three dynamic integration techniques considered and possible alternative techniques, and in Section 8 we conclude with a brief summary and a consideration further research directions.
doi:10.1016/j.inffus.2006.11.002 fatcat:hruij4647bgo7gmxt34togcv4m