Ten Years of Maintaining and Expanding a Microbial Genome and Metagenome Analysis System

Victor M. Markowitz, I-Min A. Chen, Ken Chu, Amrita Pati, Natalia N. Ivanova, Nikos C. Kyrpides
2015 Trends in Microbiology  
12 Launched in March 2005 the Integrated Microbial Genomes (IMG) system is a comprehensive 13 data management system that supports multi-dimensional comparative analysis of genomic data. 14 At the core of the IMG system is a data warehouse that contains genome and metagenome 15 datasets sequenced at the Joint Genome Institute or provided by scientific users, as well as public 16 genome datasets available at the National Center for Biotechnology Information Genbank 17 sequence data archive.
more » ... es and metagenome datasets are processed using IMG's microbial 18 genome and metagenome sequence data processing pipelines and are integrated into the data 19 warehouse using IMG's data integration toolkits. Microbial genome and metagenome application 20 © 2015. This manuscript version is made available under the Elsevier user license http://www.elsevier.com/open-access/userlicense/1.0/ specific data marts and user interfaces provide access to different subsets of IMG's data and 21 analysis toolkits. This paper revisits IMG's original aims, highlights key milestones reached by 22 the system during the past ten years, and discusses the main challenges faced by a rapidly 23 expanding system, in particular the complexity of maintaining such a system in an academic 24 setting with limited budgets and computing and data management infrastructure. 25 Rationale for developing IMG 26 U.S. Department of Energy's Joint Genome Institute (JGI) sequences genomes of isolate 27 microbial organisms and aggregate genomes of microbial communities, also known as 28 metagenomes (see Glossary). Similar to other sequencing centers, such as the J Craig Venter 29 Institute, the Welcome Trust Sanger Institute, the Broad Institute, and Washington University in 30 St. Louis, JGI employs its native structural and functional annotation pipelines for processing the 31 'raw' genome sequences before they are deposited into public genomic data archives such as 32 Genbank at the National Center for Biotechnology Information (NCBI) and the European 33 Nucleotide Archive (ENA) at the European Bioinformatics Institute (EBI). 34 35 The development of the Integrated Microbial Genomes (IMG) system started in May 2004 with 36 the goal of providing a data management system supporting the comparative analysis of genomes 37 sequenced at JGI [1]. In addition IMG provided tools for reviewing the quality of annotations of 38 the JGI genomes, thus serving as an extension of, and a feedback mechanism for, JGI's microbial 39 genome annotation pipeline. 40 When IMG was launched in March 2005, the sequencing of environmental samples at JGI and 41 other sequencing centers worldwide was starting to produce microbial community aggregate 42 genome (metagenome) datasets. Although methods for processing metagenome sequence data 43 were in their infancy at that time, it was clear that metagenome analysis needs to be conducted in 44 the context of a large set of reference isolate genomes. Consequently, a metagenome specific 45 version of IMG tightly coupled with JGI's metagenome annotation pipeline was developed in 46 2006 [2]. 47 In its ten years of existence, IMG has expanded substantially in terms of content, analytical tools 48 and community of users. As of July 1, 2015, IMG had over 44,000 genome and metagenome 49 datasets with a community of over 12,500 registered users from 93 countries. IMG has been used 50 for teaching methods of comparative analysis in tens of colleges and universities worldwide, and 51 has assisted scientists in conducting research published in thousands of papers. 52 We discuss below the evolution of IMG in terms of its data content and analytical capabilities 53 (from the scientific user perspective), and the infrastructure it relies on (from the engineering 54 perspective). 55 56 5 IMG data content 57 From its inception, IMG aimed at providing a rich comparative analysis context for examining 58 genome and metagenome datasets, in particular in terms of their genes and functional 59 annotations. Consequently, the content of IMG has been updated on a regular (first quarterly, 60 then monthly, and eventually bi-weekly) basis since 2005 by incorporating into the system the 61 genomes sequenced at JGI as well as all new public genomes published in NCBI's Genbank 62 archive [3]. 63 Core data 64 The first version of IMG released in March 2005 contained about 300 genome datasets with the 65 first 7 metagenome datasets added in March 2006. Since 2006 the content of IMG has grown 66 steadily as illustrated in Figure 1, and contained (as of July 1, 2015) 37,985 bacterial, archaeal, 67 eukaryotic, and viral genome datasets, and 6,664 metagenome datasets from about 500 68 metagenome studies, with over 31 billion protein coding genes. 69 A growing number of genomes have been generated using high-throughput sequencing of single 70 amplified genomes [4] and from the assembly and binning of genomes from metagenomes [5]. 71 Thus, IMG contains about 1,900 single cell genomes and 2,500 genomes extracted from 72 metagenomes. Since such genome datasets are often contaminated with DNA introduced from 73 the environmental sample or during the sequencing process, they undergo an automated quality 74 control and decontamination process using a recently developed tool called ProDeGe (Protocol 75 for fully automated Decontamination of Genomes) [6]. 76 New omics data have been gradually included into IMG in order to assist in further examining 77 the functions of genes. Proteomics and transcriptomics (RNAseq) datasets started to be included 78 6 into IMG in 2009 and 2011, respectively, with IMG currently containing 90 proteomic datasets 79 across 5 studies and 2,411 transcriptomic datasets across 32 studies. For genomes involved in 80 protein expression and RNASeq studies, the experiments and samples are recorded in IMG 81 together with experimental conditions while the protein expression and read counts are 82 associated with expressed genes. The organization and analysis of proteomic and trancriptomic 83 data in IMG is discussed in [7-8]. 84 Data consistency 85 The efficiency of microbial genome and metagenome comparative analysis depends on the 86 coherence of annotations, whereby proteins with the same activity are assigned the same 87 functional roles across genomes encoding them. Different functional assignments for the same 88 gene may indicate that only one (or even none) of these assignments is correct. IMG assists 89 users in identifying such genes by integrating data from different annotation sources and by 90 providing tools for assessing the consistency of functional annotations [9]. IMG's annotation 91 procedure (https://img.jgi.doe.gov/edu/doc/MGAandDI_SOP.pdf) attempts to assign every 92 protein-coding gene to three types of sequence-similarity based protein families, namely COG 93
doi:10.1016/j.tim.2015.07.012 pmid:26439299 fatcat:lle62btblrbehdepp2x7wntyee