The PEDANT genome database

D. Frishman
2003 Nucleic Acids Research  
The PEDANT genome database (http://pedant.gsf.de) provides exhaustive automatic analysis of genomic sequences by a large variety of established bioinformatics tools through a comprehensive Web-based user interface. One hundred and seventy seven completely sequenced and unfinished genomes have been processed so far, including large eukaryotic genomes (mouse, human) published recently. In this contribution, we describe the current status of the PEDANT database and novel analytical features added
more » ... o the PEDANT server in 2002. Those include: (i) integration with the BioRS TM data retrieval system which allows fast text queries, (ii) pre-computed sequence clusters in each complete genome, (iii) a comprehensive set of tools for genome comparison, including genome comparison tables and protein function prediction based on genomic context, and (iv) computation and visualization of proteinprotein interaction (PPI) networks based on experimental data. The availability of functional and structural predictions for 650 000 genomic proteins in well organized form makes PEDANT a useful resource for both functional and structural genomics. OVERVIEW AND STATUS OF THE PEDANT DATABASE IN 2003 When the first version of the PEDANT genome database was launched in 1996 (1) it provided a computational analysis of the five first completely sequenced genomes available at that time using a limited set of algorithms and with results stored as static HTML pages. In the past seven years, the PEDANT genome analysis software has matured (2): it is now based on an efficient relational database schema compatible with both MySQL TM and Oracle TM database management systems, employs a broad range of modern bioinformatics methods to analyze sequence data, and offers an extensive user interface. In parallel, the database content was explosively growing following the fast pace of genome sequencing projects. However, the main concept of the database has not changed since the first day of its existence. Since in-depth manual annotation of all genomic sequences pouring into the databases is virtually impossible our goal has been to provide exhaustive functional and structural characterization of publicly available genomes by automatic means in a timely fashion. Being fully aware of the pitfalls of automatic sequence analysis (3) we use reasonably stringent recognition parameters to avoid excessive false positive rates, and at the same time not only provide search and prediction results in digested form, but also store the raw output of bioinformatics methods, enabling the annotator or the biologist using the database to make his own judgement on the significance of the results presented. At the time of writing the total of 177 genomes are available on-line. The database consists of three major sections: 1. Genomes which undergo careful in-depth analysis by the MIPS biologists using the subsystem for manual annotation available in the PEDANT software suite. This section currently includes Neurospora crassa, Thermoplasma acidophilum, and Arabidopsis thaliana. 2. Completely sequenced and published genomes. The main source of sequence data for this section, including DNA contigs and ORF nomenclature, is the genomes division of GenBank (4), although in some cases we obtain data directly from sequencing centres. Whenever possible we use data manually curated by NCBI staff (ftp://ftp.ncbi.nih.gov/ genomes/Bacteria). If a curated version is not available, original data as submitted by the authors (ftp:// ftp.ncbi.nih.gov/genbank/genomes/Bacteria) is processed. This section contains 5 eukaryotic, 84 eubacterial, and 16 archaebacterial datasets.
doi:10.1093/nar/gkg005 pmid:12519983 pmcid:PMC165452 fatcat:iimwfhvp7ncwppddguseu4z7km