Identification of repeat structure in large genomes using repeat probability clouds

Wanjun Gu, Todd A. Castoe, Dale J. Hedges, Mark A. Batzer, David D. Pollock
2008 Analytical Biochemistry  
The identification of repeat structure in eukaryotic genomes can be time-consuming and difficult because of the large amount of information ($3 Â 10 9 bp) that needs to be processed and compared. We introduce a new approach based on exact word counts to evaluate, de novo, the repeat structure present within large eukaryotic genomes. This approach avoids sequence alignment and similarity search, two of the most time-consuming components of traditional methods for repeat identification.
more » ... were implemented to efficiently calculate exact counts for any length oligonucleotide in large genomes. Based on these oligonucleotide counts, oligonucleotide excess probability clouds, or "P-clouds," were constructed. P-clouds are composed of clusters of related oligonucleotides that occur, as a group, more often than expected by chance. After construction, P-clouds were mapped back onto the genome, and regions of high P-cloud density were identified as repetitive regions based on a sliding window approach. This efficient method is capable of analyzing the repeat content of the entire human genome on a single desktop computer in less than half a day, at least 10-fold faster than current approaches. The predicted repetitive regions strongly overlap with known repeat elements as well as other repetitive regions such as gene families, pseudogenes, and segmental duplicons. This method should be extremely useful as a tool for use in de novo identification of repeat structure in large newly sequenced genomes. Ó 2008 Elsevier Inc. All rights reserved. Eukaryotic genomes contain many repetitive sequences, and understanding genome structure depends crucially on their identification [1-3]. The predominant repeat annotation approach, implemented in RepeatMasker [4], focuses on the identification of repeat element sequences based on their alignment with consensus sequences and relies on a curated library of known repeat families provided by Repbase [5]. This approach is presumably most effective for the human genome, which has attracted the greatest interest and the longest curation history, whereas the necessary libraries for more recently sequenced genomes may be substantially less complete or nonexistent. It is unknown how effective this common approach is overall, however, because there is no "gold standard" to determine the proportion of true repeats that have been identified, and this approach has simply been implemented on an ad hoc basis. Methods for the de novo analysis of repeat structure have also been developed to annotate repeat elements in newly sequenced genomes independent of an a priori established repeat library. Such approaches have been implemented in RepeatFinder [6], RE-CON [7], RepeatScout [8], and PILER [9]. These methods essentially construct a repeat library by assembling genome alignments and use sequence similarity searches to annotate repeat elements in the genome (analogous to RepeatMasker). All require extensive computational effort and/or capability that limit the ability of individual genomic researchers to extensively investigate repeat structure, particularly for mammalian and other large genomes [10] . Repeat structure in large genomes has been analyzed without first constructing consensus repeat family sequences [11, 12] , including the use of oligonucleotide (hereafter "oligo") or lmer similarity, rather than sequence similarity [13, 14] , and analytical counting methods such as RAP [15] and the method of Healy and coworkers [16] . There has been some statistical evaluation of oligo-based repeat region identification using these methods [15, 16] , but no comprehensive genomic annotation approaches have been developed for oligo-based repeat analysis. Here we describe the implementation of a new approach for the identification of repetitive regions of large genomes using oligo frequencies. Our goal was to develop a fast algorithm for de novo identification of repeated structures applicable to entire eukaryotic genomes that could be reasonably implemented using existing desktop computers. The resulting approach is computationally efficient for analyzing large genomes and is effective at identifying repeat elements. The principal novelty behind our approach arises from the realization that repetitive elements are likely to have given rise to clusters of similar oligos and that it may be statistically 0003-2697/$ -see front matter Ó
doi:10.1016/j.ab.2008.05.015 pmid:18541131 pmcid:PMC2533575 fatcat:6nm2pu7y3nb6nhenrpu6jk2yfy