Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins

Igor B. Kuznetsov, Zhenkun Gou, Run Li, Seungwoo Hwang
2006 Proteins: Structure, Function, and Bioinformatics  
Proteins that interact with DNA are involved in a number of fundamental biological activities such as DNA replication, transcription, and repair. A reliable identification of DNA-binding sites in DNA-binding proteins is important for functional annotation, site-directed mutagenesis, and modeling protein-DNA interactions. We apply Support Vector Machine (SVM), a supervised pattern recognition method, to predict DNA-binding sites in DNA-binding proteins using the following features: amino acid
more » ... uence, profile of evolutionary conservation of sequence positions, and low-resolution structural information. We use a rigorous statistical approach to study the performance of predictors that utilize different combinations of features and how this performance is affected by structural and sequence properties of proteins. Our results indicate that an SVM predictor based on a properly scaled profile of evolutionary conservation in the form of a position specific scoring matrix (PSSM) significantly outperforms a PSSM-based neural network predictor. The highest accuracy is achieved by SVM predictor that combines the profile of evolutionary conservation with low-resolution structural information. Our results also show that knowledgebased predictors of DNA-binding sites perform significantly better on proteins from mainly-␣ structural class and that the performance of these predictors is significantly correlated with certain structural and sequence properties of proteins. These observations suggest that it may be possible to assign a reliability index to the overall accuracy of the prediction of DNA-binding sites in any given protein using its sequence and structural properties. A web-server implementation of the predictors is freely available online at dp-bind/. Proteins 2006;64:19 -27. PREDICTION OF DNA-BINDING SITES Abbreviations: Percentage ␣-helix, percentage of positions in ␣-helical conformation (H, according to the DSSP assignment); percentage ␤-strand, percentage of positions in ␤-strand conformation (E, according to the DSSP assignment); percentage coil, percentage of positions in coil conformation (everything other than helix and sheet); percentage abrupt turn, percentage of positions in abrupt turns (S, according to the DSSP assignment). Each cell shows the Spearman rank correlation between a particular property given by the row name and the accuracy of a particular SVM predictor given by the column name. A correlation is considered to be highly significant if its p value is less than 0.05/7 (0.0071), a Bonferroni-adjusted ␣-level for seven tests. This adjustment takes into account the fact that for the same SVM predictor we compute correlations with seven different properties. A correlation is considered to be not significant if the p value is greater than 0.05. Highly significant differences are shown in boldface type. All other notation is the same as in Table II . Results for balanced test datasets.
doi:10.1002/prot.20977 pmid:16568445 fatcat:dhcxoldce5agtgmtvnx53qhzda