Design and analysis of a multi-dimensional data sampling service for large scale data analysis applications

Xi Zhang, T. Kurc, J. Saltz, S. Parthasarathy
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
Sampling is a widely used technique to increase efficiency in database and data mining applications operating on large dataset. In this paper we present a scalable sampling implementation that supports efficient, multi-dimensional spatio-temporal sample generation on dynamic, large scale datasets stored on a storage cluster. The proposed algorithm leverages Hilbert space-filling curves in order to provide an approximate linear order of multidimensional data while maintaining spatial locality.
more » ... is new implementation is then bootstrapped on top of our previous implementation, which efficiently samples large datasets along a single dimension (e.g., time), thereby realizing a service for spatio-temporal sampling. We evaluate the performance of our approach comparing it to the popular R-tree based technique. The experimental results show that our approach achieves up to an order of magnitude higher efficiency and scalability.
doi:10.1109/ipdps.2006.1639315 dblp:conf/ipps/ZhangKSP06 fatcat:emeykbxpijdtxlqbwdfsid36s4