Drosophila Genomic Sequence Annotation Using the BLOCKS+ Database

J. G. Henikoff
2000 Genome Research  
A simple and general homology-based method for gene finding was applied to the 2.9-Mb Drosophila melanogaster Adh region, the target sequence of the Genome Annotation Assessment Project (GASP). Each strand of the entire sequence was used as query of the BLOCKS+ database of conserved regions of proteins. This led to functional assignments for more than one-third of the genes and two-thirds of the transposons. Considering the enormous size of the query, the fact that only two false-positive
more » ... s were reported emphasizes the high selectivity of protein family-based methods for gene finding. We used the search results to improve BLOCKS+ by identifying compositionally biased blocks. Our results confirm that protein family databases can be used effectively in automated sequence annotation efforts. Sequence similarity searches for detecting protein relationships have become so popular that one method for doing this is now familiarly described by the verb "to blast." Detecting a hit in a sequence data bank is frequently the best clue as to the function of a gene, so that sequence similarity searching is de rigeur for any genomic annotation effort. A routine annotation strategy is to first arrive at a gene model (Fields and Soderlund 1990), translate it into protein, then use the predicted protein as query of sequence data banks (Pearson and Lipman 1988; Altschul et al. 1990 ). Most entrants in the GASP (Genome Annotation Assessment Project) study attempted to find accurate gene models, and their success in doing this is the basis for assessment of their performance (Reese et al. 2000a ). Some methods used sequence similarity searches of cDNA databases to aid in predicting accurate gene models. Another method (GeneWise) screened gene models against a protein family database. Our method differs in that we dispensed entirely with the gene modeling step, using the full genomic segment to query a protein family database. The rationale is that protein sequence is so rich in information that even this simple approach will be sufficiently sensitive to find the genes and assign functions to them. Our method is nearly a decade old. Using protein queries to search DNA databases translated in all six frames, which was introduced 12 years ago (Henikoff and Wallace 1988; Pearson and Lipman 1988), has since become a standard procedure, especially for searching EST databases (Adams et al. 1991). Alternatively, a DNA query can be translated for searching protein sequence or protein family databases, such as the BLOCKS database (Henikoff and Henikoff 1991). Entries in the BLOCKS database are ungapped multiple alignments of conserved regions of proteins, averaging four BLOCKS per protein family. In a search, detection of multiple BLOCKS representing a family are combined into a hit. In a translated search, BLOCKS are combined into a hit even when they are in different frames on the same strand. Earlier, we reported the detection of a Pseudomonas cepacia regulatory gene (dgdR) and protein family homology for dgdA within a 4-kb genomic segment used as query (Henikoff and Henikoff 1991); both had been missed because of frameshift sequencing errors. This example emphasized the fact that translated searching allows for gene detection and family assignment without requiring assumptions as to the presence of ORFs or the accuracy and completeness of the sequence used as query. With the release of the first complete chromosome sequence, Saccharomyces cerevisiae chromosome III (Oliver et al. 1992) , we applied this fully automated method to a >300-kb genomic segment (Henikoff and Henikoff 1994) . Each frame of the entire sequence was used to search a 1992 version of the BLOCKS database, and the results for each strand were combined to make gene predictions. We found 37 significant hits, of which 34 were genes discovered by others, 1 was a new gene not detected by others, and 2 were judged to be false positives. This number of hits represented only 40% of what could be found using pairwise approaches, an expected result considering the low coverage of the 1992 BLOCKS database relative to what was available in sequence data banks. When we repeated the search on a 1993 version of the BLOCKS database, 10 more genes were found, a consequence of expansion of the BLOCKS database from 504 to 619 protein families (Henikoff and Henikoff 1994) . At the time of the GASP study (June 1999), the BLOCKS database had increased to >2000 protein families. Most of the increase is due to supplementation of the original BLOCKS database, which is based on fami-1 Corresponding author. E-MAIL steveh@fhcrc.org; FAX (206) 667-5889.
doi:10.1101/gr.10.4.543 pmid:10779495 pmcid:PMC310867 fatcat:wfdq6bhrkrcj7acyzaobfvjlai