Classifying High-Dimensional Text and Web Data Using Very Short Patterns

Hassan H. Malik, John R. Kender
2008 2008 Eighth IEEE International Conference on Data Mining  
In this paper, we propose the "Democratic Classifier", a simple, democracy-inspired patternbased classification algorithm that uses very short patterns for classification, and does not rely on the minimum support threshold. Borrowing ideas from democracy, our training phase allows each training instance to vote for an equal number of candidate size-2 patterns. Similar to the usual democratic election process, where voters select candidates by considering their qualifications, prior
more » ... at the constituency and territory levels, as well as their own perception about candidates, the training instances select patterns by effectively balancing between local, class, and global significance of patterns. In addition, we respect "each voter's opinion" by simultaneously adding shared patterns to all applicable classes, and then apply a novel power law based weighing scheme, instead of making binary decisions on these patterns. Results of experiments performed on 121 common text and web datasets show that our algorithm almost always outperforms state of the art classification algorithms, without requiring any dataset-specific parameter tuning. On 100 real-life, noisy, web datasets, the average absolute classification accuracy improvement was as great as 9.4% over SVM, Harmony, C4.5 and KNN. Also, our algorithm ran about 3.5 times faster than the fastest existing patternbased classification algorithm. Three levels of pattern significance A number of existing rule-based classification algorithms [2, 3, 4] follow a greedy rule-induction process to discover classification rules. In these
doi:10.1109/icdm.2008.139 dblp:conf/icdm/MalikK08 fatcat:riql2nfacvckdcv3zvhw3oh2bu