Training Distributed GP Ensemble With a Selective Algorithm Based on Clustering and Pruning for Pattern Classification

G. Folino, C. Pizzuti, G. Spezzano
2008 IEEE Transactions on Evolutionary Computation  
A boosting algorithm based on cellular genetic programming to build an ensemble of predictors is proposed. The method evolves a population of trees for a fixed number of rounds and, after each round, it chooses the predictors to include into the ensemble by applying a clustering algorithm to the population of classifiers. Clustering the population allows the selection of the most diverse and fittest trees that best contribute to improve classification accuracy. The method proposed runs on a
more » ... ributed hybrid environment that combines the island and cellular models of parallel genetic programming. The combination of the two models provides an efficient implementation of distributed GP, and, at the same time, the generation of low sized and accurate decision trees. The large amount of memory required to store the ensemble makes the method heavy to deploy. The paper shows that, by applying suitable pruning strategies, it is possible to select a subset of the classifiers without increasing misclassification errors; indeed, for some data sets, up to 30% of pruning, ensemble accuracy increases. Experimental results show that the combination of clustering and pruning enhances classification accuracy of the ensemble approach. In this paper a distributed boosting cellular Genetic Programming classifier to build the ensemble of predictors is proposed. The algorithm, named ClustBoostCGP C (Clustering Boost Cellular Genetic Programming Classifier), runs on a distributed environment based on a hybrid model [2] that combines the island model with the cellular model. The island model enables an efficient implementation of distributed GP. On the other hand, the cellular model allows the generation of classifiers with better accuracy and reduced tree size. Each node of the network is considered as an island that contains a learning algorithm, based on cellular genetic programming, whose aim is to generate decision-tree predictors trained on the local data stored in the node. Every genetic program, however, though isolated, cooperates with the neighboring nodes by collaborating with the other learning components located on the network, and takes advantage of the cellular model by asynchronously exchanging the outermost individuals of the population. ClustBoostCGP C constructs an ensemble of accurate and diverse classifiers by employing a clustering strategy to each subpopulation located on the nodes of the network. The strategy, at each boosting round, finds groups of individuals similar, with respect to a similarity measure, and then takes the individual of each cluster having the best fitness. This allows the selection, from each subpopulation, of the most dissimilar and fittest trees. The main drawback of the approach proposed is that the size of the ensemble increases as the number of clusters and the nodes of the network increases. Thus we could ask if it is possible to discard some of these predictors and still obtain comparable accuracy. The paper shows that, by applying suitable pruning strategies, it is possible to select a subset of the classifiers without augmenting misclassification errors; indeed, up to 30% of pruning, ensemble accuracy increases. The main contributions of the paper can be summarized as follows. ClustBoostCGP C is a distributed ensemble method that mixes a supervised classification method with an unsupervised clustering method to build an ensemble of predictors. Clustering the population of classifiers revealed a successful approach. In fact the misclassification error rate of the ensemble sensibly diminishes when the ensemble is constituted by the best individuals in the clustered populations.
doi:10.1109/tevc.2007.906658 fatcat:wrtzkmkbrzdkxf35aamncl36d4