A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases [chapter]

Tobias Scheffer, Stefan Wrobel
2002 Lecture Notes in Computer Science  
Many data mining tasks can be seen as an instance of the problem of finding the most interesting (according to some utility function) patterns in a large database. In recent years, significant progress has been achieved in scaling algorithms for this task to very large databases through the use of sequential sampling techniques. However, except for sampling-based greedy algorithms which cannot give absolute quality guarantees, the scalability of existing approaches to this problem is only with
more » ... espect to the data, not with respect to the size of the pattern space: it is universally assumed that the entire hypothesis space fits in main memory. In this paper, we describe how this class of algorithms can be extended to hypothesis spaces that do not fit in memory while maintaining the algorithms' precise ε − δ quality guarantees. We present a constant memory algorithm for this task and prove that it possesses the required properties. In an empirical comparison, we compare variable memory and constant memory sampling.
doi:10.1007/3-540-45681-3_33 fatcat:4mbvu2za3jec7l4wm4cqtypmma