Efficiently Mining Interesting Emerging Patterns
Lecture Notes in Computer Science
Knowledge Discovery in Databases (KDD), or Data Mining is used to discover interesting or useful patterns and relationships in data, with an emphasis on large volume of observational databases. Among many other types of information (knowledge) that can be discovered in data, patterns that are expressed in terms of features are popular because they can be understood and used directly by people. The recently proposed Emerging Pattern (EP) is one type of such knowledge patterns. Emerging Patterns
... re sets of items (conjunctions of attribute values) whose frequency changes significantly from one dataset to another. They are useful as a means of discovering distinctions inherently present amongst a collection of datasets and have been shown to be a powerful method for constructing accurate classifiers. In this doctoral dissertation, we study the following three major problems involved in the discovery of Emerging Patterns and the construction of classification systems based on Emerging Patterns: 1. How to efficiently discover the complete set of Emerging Patterns between two classes of data? 2. Which Emerging Patterns are interesting, namely, which Emerging Patterns are novel, useful and non-trivial? 3. Which Emerging Patterns are useful for classification purpose? And how to use these Emerging Patterns to build efficient and accurate classifiers? vi Abstract We propose a special type of Emerging Pattern, called Essential Jumping Emerging Pattern (EJEP). The set of EJEPs is the subset of the set of Jumping Emerging Patterns (JEPs), after removing those JEPs that potentially contain noise and redundant information. We show that a relatively small set of EJEPs are sufficient for building accurate classifiers, instead of mining many JEPs. We generalize the "interestingness" measures for Emerging Patterns, including the minimum support, the minimum growth rate, the subset relationship between EPs and the correlation based on common statistical measures such as the chi-squared value. We show that those "interesting" Emerging Patterns (called Chi EPs) not only capture the essential knowledge to distinguish two classes of data, but also are excellent candidates for building accurate classifiers. The task of mining Emerging Patterns is computationally difficult for large, dense and high-dimensional datasets due to the "curse of dimensionality". We develop new treebased pattern fragment growth methods for efficiently mining EJEPs and Chi EPs. We propose a novel approach to use Emerging Patterns as a basic means for classification, called Bayesian Classification by Emerging Patterns (BCEP). As a hybrid of the EP-based classifier and Naive Bayes (NB) classifier, BCEP offers the following advantages: (1) it is based on theoretically well-founded mathematical model as in Large Bayes (LB); (2) it relaxes the strong attribute independence assumption of NB; (3) it is easy to interpret, because typically only a small number of Emerging Patterns are used in classification after pruning. Real-world classification problems always contain noise. A reliable classifier should be tolerant to a reasonable level of noise. Our study of noise tolerance of BCEP shows that BCEP generally handles noise better in comparison with other state-of-the-art classifiers. We conduct extensive empirical study on benchmark datasets from the UCI Machine Learning Repository to show that our EP mining algorithms are efficient and our EP-based classifiers are superior to other state-of-the-art classification methods in terms of overall predictive accuracy.