New techniques for extracting features from protein sequences

J. T. L. Wang, Q. Ma, D. Shasha, C. H. Wu
2001 IBM Systems Journal  
In this paper we propose new techniques to extract features from protein sequences. We then use the features as inputs for a Bayesian neural network (BNN) and apply the BNN to classifying protein sequences obtained from the PIR (Protein Information Resource) database maintained at the National Biomedical Research Foundation. To evaluate the performance of the proposed approach, we compare it with other protein classifiers built based on sequence alignment and machine learning methods.
more » ... methods. Experimental results show the high precision of the proposed classifier and the complementarity of the bioinformatics tools studied in the paper. A s a result of the Human Genome Project and related efforts, DNA (dioxyribonucleic acid), RNA (ribonucleic acid), and protein data accumulate at an accelerating rate. Mining these biological data to extract useful knowledge is essential in genome processing. This subject has recently gained significant attention in the bioinformatics community. 1-6 We present here a case study in extracting features from protein sequences and using them together with a Bayesian neural network to classify the sequences. The problem studied here can be stated formally as follows: Given are an unlabeled protein sequence S and a known superfamily F; we want to determine whether or not S belongs to F. (We refer to F as the target class and the set of sequences not in F as the nontarget class.) In general, a superfamily is a group of proteins that share similarity in structure and function. If the unlabeled sequence S is determined to belong to F, then one can infer the structure and function of S. This process is important in many aspects of bioinformatics and computational biology. 7-9 For example, in drug discovery, if sequence S is obtained from some disease X and it is determined that S belongs to the superfamily F, then one may try a combination of the existing drugs for F to treat the disease X. There are several approaches available for protein sequence classification. One approach is to compare the unlabeled sequence S with the sequences in the target class and the sequences in the nontarget class using an alignment tool such as BLAST. 10 One then assigns S to the class containing the sequence best matching S. The second method for protein sequence classification is based on hidden Markov models (HMMs). 11 The HMM method (e.g., SAM 12 and HMMer 13 ) employs a machine-learning algorithm, which uses probabilistic graphical models to describe time-series and sequence data. It was originally applied to speech recognition, 14 and now is also applied to modeling and analyzing protein superfamilies. It is a generalization of the position-specific scoring matrix to include insertion and deletion states. Often, an HMM is built for each (super)family. One then scores the unlabeled sequence S with respect to the HMM of a (super)family. 15 If the score is more significant than a cut-off value, then S is regarded as a member of the (super)family.
doi:10.1147/sj.402.0426 fatcat:xv2veybu7jc6feudndhbttbeza