Correcting the hub occurrence prediction bias in many dimensions

Nenad Tomasev, Krisztian Buza, Dunja Mladenic
2016 Computer Science and Information Systems  
Data reduction is a common pre-processing step for k-nearest neighbor classification (kNN). The existing prototype selection methods implement different criteria for selecting relevant points to use in classification, which constitutes a selection bias. This study examines the nature of the instance selection bias in intrinsically high-dimensional data. In high-dimensional feature spaces, hubs are known to emerge as centers of influence in kNN classification. These points dominate most kNN sets
more » ... and are often detrimental to classification performance. Our experiments reveal that different instance selection strategies bias the predictions of the behavior of hub-points in high-dimensional data in different ways. We propose to introduce an intermediate un-biasing step when training the neighbor occurrence models and we demonstrate promising improvements in various hubness-aware classification methods, on a wide selection of high-dimensional synthetic and real-world datasets.
doi:10.2298/csis140929039t fatcat:c5veelea4bb2ho7lgw6btmzu4q