Accuracy of structure-based sequence alignment of automatic methods

Changhoon Kim, Byungkook Lee
2007 BMC Bioinformatics  
Accurate sequence alignments are essential for homology searches and for building threedimensional structural models of proteins. Since structure is better conserved than sequence, structure alignments have been used to guide sequence alignments and are commonly used as the gold standard for sequence alignment evaluation. Nonetheless, as far as we know, there is no report of a systematic evaluation of pairwise structure alignment programs in terms of the sequence alignment accuracy. Results In
more » ... his study, we evaluate CE, DaliLite, FAST, LOCK2, MATRAS, SHEBA and VAST in terms of the accuracy of the sequence alignments they produce, using sequence alignments from NCBI's human-curated Conserved Domain Database (CDD) as the standard of truth. We find that 4 to 9% of the residues on average are either not aligned or aligned with more than 8 residues of shift error and that an additional 6 to 14% of residues on average are misaligned by 1-8 residues, depending on the program and the data set used. The fraction of correctly aligned residues generally decreases as the sequence similarity decreases or as the RMSD between the C α positions of the two structures increases. It varies significantly across CDD superfamilies whether shift error is allowed or not. Also, alignments with different shift errors occur between proteins within the same CDD superfamily, leading to inconsistent alignments between superfamily members. In general, residue pairs that are more than 3.0 Å apart in the reference alignment are heavily (>=25% on average) misaligned in the test alignments. In addition, each method shows a different pattern of relative weaknesses for different SCOP classes. CE gives relatively poor results for β-sheet-containing structures (all-β, α/β, and α+β classes), DaliLite for "others" class where all but the major four classes are combined, and LOCK2 and VAST for all-β and "others" classes. 3 Conclusions When the sequence similarity is low, structure-based methods produce better sequence alignments than by using sequence similarities alone. However, current structure-based methods still mis-align 11-19% of the conserved core residues when compared to the human-curated CDD alignments. The alignment quality of each program depends on the protein structural type and similarity, with DaliLite showing the most agreement with CDD on average. Background Accurate sequence alignments for homologous proteins are essential for constructing accurate motifs and profiles, which are used in motif-or profile-based protein function search models[1-3] and in building homology models [4, 5] . When sequence similarity is low, however, it is difficult to obtain the correct sequence alignment based on sequence similarity alone [3, 4] . Since it is well known that proteins can have similar structures even in the absence of any detectable sequence similarity, structural alignments have been used to guide sequence alignments and are used as the gold standard for sequence alignment evaluation [5, 6]. Many pairwise structure alignment programs have been developed, but their performance has often been measured by how well the programs reproduce an expert-curated structure classification, such as SCOP or CATH [7, 8] . It has been shown that some programs do not produce high quality individual alignments, as measured by geometric match measures such as SAS or GSAS, even when they perform well in classification tests [9] . It is also known that structure-based sequence alignments produced by different programs can be different even when the superimposed structures are similar [4, 5, [10] [11] [12] . Nonetheless, as far as we know, there is no report of a systematic evaluation of commonly used structural alignment programs in terms of the sequence alignment accuracy, perhaps because it has been difficult to find a fully human-curated and reasonably difficult reference alignment set [13, 14] . Abbreviations Program names CE, Combinatorial Extension; DaliLite, standalone version of DALI (Distance mAtrix ALIgnment); DSSP, Definition of Secondary Structure of Proteins given a set of 3D coordinates; FAST, Recursive acronym for FAST Alignment and Search Tool; FASTA3, DNA and Protein sequence alignment software package; LOCK2, Improvements over 20 LOCK (Hierarchical protein structure superposition); MATRAS, MArkovian TRAnsition of protein Structure; SHEBA, Structural Homology by Environment-Based
doi:10.1186/1471-2105-8-355 pmid:17883866 pmcid:PMC2039753 fatcat:bt5mfakv4jdhjjtis6c263n7mi