SuRF: Identification of Interesting Data Regions with Surrogate Models
2020 IEEE 36th International Conference on Data Engineering (ICDE)
Several data mining tasks focus on repeatedly inspecting multidimensional data regions summarized by a statistic. The value of this statistic (e.g., region-population sizes, order moments) is used to classify the region's interesting-ness. These regions can be naively extracted from the entire dataspacehowever, this is extremely time-consuming and compute-resource demanding. This paper studies the reverse problem: analysts provide a cut-off value for a statistic of interest and in turn our
... sed framework efficiently identifies multidimensional regions whose statistic exceeds (or is below) the given cut-off value (according to user's needs). However, as data dimensions and size increase, such task inevitably becomes laborious and costly. To alleviate this cost, our solution, coined SuRF (SUrrogate Region Finder), leverages historical region evaluations to train surrogate models that learn to approximate the distribution of the statistic of interest. It then makes use of evolutionary multi-modal optimization to effectively and efficiently identify regions of interest regardless of data size and dimensionality. The accuracy, efficiency, and scalability of our approach are demonstrated with experiments using synthetic and real-world datasets and compared with other methods. Index Terms-Surrogate model estimation, statistical learning, swarm intelligence, evolutionary multimodal optimization.