Detection of Underrepresented Biological Sequences Using Class-Conditional Distribution Models [chapter]

Slobodan Vucetic, Dragoljub Pokrajac, Hongbo Xie, Zoran Obradovic
2003 Proceedings of the 2003 SIAM International Conference on Data Mining  
A labeled sequence data set related to a certain biological property is often biased and, therefore, does not completely capture its diversity in nature. To reduce this sampling bias problem a data mining procedure is proposed for detecting underrepresented relevant sequences. The procedure is aimed at helping domain experts achieve a cost-effective qualitative enlargement of knowledge through an in-depth study of a small number of statistically underrepresented and functionally interesting
more » ... ences. Our procedure consists of: (i) learning a class-conditional distribution model on each class of labeled data; (ii) applying the models to select statistically underrepresented unlabeled sequences; and (iii) automatically evaluating their interestingness. An application of the proposed approach is illustrated on an important problem of increasing the data set of confirmed disordered proteins. The obtained results demonstrate the promise of the proposed approach for an efficient reduction of sampling bias in biological databases.
doi:10.1137/1.9781611972733.30 dblp:conf/sdm/VuceticPXO03 fatcat:xdetkknnafbulexarogp7pozpi