Scalable Rule-Based Gene Expression Data Classification

Mark A. Iwen, Willis Lang, Jignesh M. Patel
2008 2008 IEEE 24th International Conference on Data Engineering  
Current state-of-the-art association rule-based classifiers for gene expression data operate in two phases: (i) Association rule mining from training data followed by (ii) Classification of query data using the mined rules. In the worst case, these methods require an exponential search over the subset space of the training data set's samples and/or genes during at least one of these two phases. Hence, existing association rulebased techniques are prohibitively computationally expensive on large
more » ... expensive on large gene expression datasets. Our main result is the development of a heuristic rule-based gene expression data classifier called Boolean Structure Table Classification (BSTC). BSTC is explicitly related to association rule-based methods, but is guaranteed to be polynomial space/time. Extensive cross validation studies on several real gene expression datasets demonstrate that BSTC retains the classification accuracy of current association rule-based methods while being orders of magnitude faster than the leading classifier RCBT on large datasets. As a result, BSTC is able to finish table generation and classification on large datasets for which current association rule-based methods become computationally infeasible. BSTC also enjoys two other advantages over association rulebased classifiers: (i) BSTC is easy to use (requires no parameter tuning), and (ii) BSTC can easily handle datasets with any number of class types. Furthermore, in the process of developing BSTC we introduce a novel class of boolean association rules which have potential applications to other data mining problems.
doi:10.1109/icde.2008.4497515 dblp:conf/icde/IwenLP08 fatcat:qfchwupjijgqfda3n3e2dv7s4q