Generating highly accurate prediction hypotheses through collaborative ensemble learning

Nino Arsov, Martin Pavlovski, Lasko Basnarkov, Ljupco Kocarev
2017 Scientific Reports  
Ensemble generation is a natural and convenient way of achieving better generalization performance of learning algorithms by gathering their predictive capabilities. Here, we nurture the idea of ensemblebased learning by combining bagging and boosting for the purpose of binary classification. Since the former improves stability through variance reduction, while the latter ameliorates overfitting, the outcome of a multi-model that combines both strives toward a comprehensive net-balancing of the
more » ... bias-variance trade-off. To further improve this, we alter the bagged-boosting scheme by introducing collaboration between the multi-model's constituent learners at various levels. This novel stabilityguided classification scheme is delivered in two flavours: during or after the boosting process. Applied among a crowd of Gentle Boost ensembles, the ability of the two suggested algorithms to generalize is inspected by comparing them against Subbagging and Gentle Boost on various real-world datasets. In both cases, our models obtained a 40% generalization error decrease. But their true ability to capture details in data was revealed through their application for protein detection in texture analysis of gel electrophoresis images. They achieve improved performance of approximately 0.9773 AUROC when compared to the AUROC of 0.9574 obtained by an SVM based on recursive feature elimination. Machine learning has been transforming the world by improving our understanding of artificial intelligence 1-3 and by providing solutions for some outstanding problems such as multi-modal parcellation of human cerebral cortex 4 and materials discovery 5 . A learning algorithm generalizes if, given access to some training set, it returns a hypothesis whose empirical error is close to its true error 6 . There are three main approaches to institute generalization guarantees: (1) by providing bounds of various notions of functional space capacity-most notably, using the VC-dimension 7 ; (2) by establishing connections between the stability of a learning algorithm and its ability to generalize 8-10 , and (3) by considering the compression-scheme method 11 . Here we describe an effective way to fuse boosting and bagging ensembles in which algorithmic stability directs a novel process of collaboration between the resulting ensemble's weak/strong components that outperforms best-case boosting/bagging for a broad range of applications and under a variety of scenarios. The algorithms were assessed on various realistic datasets, showing improved performance in all cases, on average of slightly below 40%, compared to the best-case boosting/bagging counterparts. Furthermore, in a medical setting for protein detection in texture analysis of gel electrophoresis images 12 , our approach exhibits surpassing performance of approximately 0.9773 area under the ROC curve (AUROC), compared to three machine-learning feature selection approaches: Multiple Kernel Learning, Recursive Feature Elimination with different classifiers and a Genetic Algorithm-based approach with Support Vector Machines (SVMs) as decision functions, having 0.9574 or less AUROCs. Moreover, when collaboration is effectuated with weak components, our algorithm runs up to more than five times faster than the underlying boosting algorithm. We anticipate our approach to be a starting point for more sophisticated models for generating stability-guided collaborative learning approaches, not necessarily limited to boosting. Ensemble techniques 13-15 show improved accuracy of predictive analytics and data mining applications. In a typical ensemble method, the base inducers and diversity generators are responsible for generating diverse classifiers which represent the generalized relationship between the input and the target attributes. A strong classifier
doi:10.1038/srep44649 pmid:28304378 pmcid:PMC5356335 fatcat:kymz6d7fofcnnfs3gtjwxjse5u