EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data
ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed loci. Common strategies recently employed to this end are based on sequence similarity between
... ripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an ad hoc genomic mapping. Methods: EasyCluster uses the well-known GMAP program in order to perform a very quick EST-togenome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site. Results: The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.