Segmenting the human genome based on states of neutral genetic divergence

Prabhani Kuruppumullage Don, Guruprasad Ananda, Francesca Chiaromonte, Kateryna D. Makova
2013 Proceedings of the National Academy of Sciences of the United States of America  
Many studies have demonstrated that divergence levels generated by different mutation types vary and covary across the human genome. To improve our still-incomplete understanding of the mechanistic basis of this phenomenon, we analyze several mutation types simultaneously, anchoring their variation to specific regions of the genome. Using hidden Markov models on insertion, deletion, nucleotide substitution, and microsatellite divergence estimates inferred from human-orangutan alignments of
more » ... ally evolving genomic sequences, we segment the human genome into regions corresponding to different divergence states-each uniquely characterized by specific combinations of divergence levels. We then parsed the mutagenic contributions of various biochemical processes associating divergence states with a broad range of genomic landscape features. We find that high divergence states inhabit guanine-and cytosine (GC)-rich, highly recombining subtelomeric regions; low divergence states cover inner parts of autosomes; chromosome X forms its own state with lowest divergence; and a state of elevated microsatellite mutability is interspersed across the genome. These general trends are mirrored in human diversity data from the 1000 Genomes Project, and departures from them highlight the evolutionary history of primate chromosomes. We also find that genes and noncoding functional marks [annotations from the Encyclopedia of DNA Elements (ENCODE)] are concentrated in high divergence states. Our results provide a powerful tool for biomedical data analysis: segmentations can be used to screen personal genome variants-including those associated with cancer and other diseases-and to improve computational predictions of noncoding functional elements. W hole-genome sequencing studies have demonstrated that divergence estimates for several mutation types (e.g., nucleotide substitutions, insertions, and deletions) vary substantially across the human genome. This phenomenon has been studied at various genomic scales and evolutionary distances (reviewed in ref. 1), and-whereas initially of interest solely to evolutionary biologists-is now entering the purview of main biomedical research. Specifically, human population (e.g., ref. 2) and cancer (3, 4) genome resequencing projects have revealed that incidences of single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and copy number variants (CNVs) vary across the genome. Divergence estimates for different mutation types also covary across the genome (5, 6)-e.g., substitution rates increase in regions with high indel rates (7)-suggesting that regional variation is an important and general characteristic of mutations. Variation in divergence is often linked to genomic landscape features such as base composition, replication timing, and recombination rates (1). For instance, nucleotide substitution rates are elevated in late-replicating regions because of an accumulation of single-stranded DNA susceptible to endogenous damage (8) and are affected by chromatin structure (9) and recombination as a result of either biased gene conversion (BGC) (10) or the mutagenicity of recombination (2). Moreover, nucleotide substitution rates depend nonlinearly on guanine and cytosine (GC) content (11) and are affected by methylation levels and GC content at cytosine-phosphate-guanine (CpG) sites (12) and by replication timing and distance to telomeres at non-CpG sites (13). Covariation in divergence among rates of different mutation types can also be at least partly attributed to the influence of a common genomic landscape (5). Importantly, we note that, whereas selection may indeed operate in noncoding regions, it is unlikely to explain the large degree of variation and covariation in divergence estimates computed from putatively neutral sequences (1, and references therein). Divergence computed for neutral DNA ought to reflect mutation, BGC, and-for relatively distant species-only a minimal amount of diversity. Anchoring variation and covariation of divergence estimates for different mutation types to specific regions of the genome is crucial for elucidating how biochemical processes-e.g., replication and recombination-drive mutagenesis (8, 10) and for understanding genome evolution. Such a "geographic" characterization may correlate with the spatial distribution of genes; for instance, cellular receptor and housekeeping genes tend to locate, respectively, in high and low nucleotide substitution rate regions (14) . It may also aid prediction of noncoding functional elements (15, 16) and studies of the genetic basis of disease. For instance, it could assist in (i) discerning whether a locus exhibits an excess of mutations because it resides in a hotspot, thus preventing false positive associations with a disease; and (ii) identifying loci with mutational signatures typical of a disease, e.g., explaining frequent coincidence in tumors of sites prone to DNA damage and chromosomal instability (17) . With this motivation, we used hidden Markov models (HMMs) (18), a well-established statistical tool, to analyze human divergence for different mutation types. An HMM models a sequence of observations as governed by underlying states that are not directly observable (hidden) but can be inferred based on the data. These states alternate along the sequence following a Markovian structure, i.e., the state governing a given observation may depend on the state governing the preceding observation. Significance In addition to a significant contribution to our understanding of the intricacies of mutagenesis, this study provides a powerful platform for mining biomedical data-which we make publicly available through the University of California Santa Cruz Genome Browser and the Galaxy portal. The divergence states we characterize serve as local background to benchmark signals used in computational algorithms for prediction of noncoding functional elements and in screening variants from cancer and other disease-affected genomes.
doi:10.1073/pnas.1221792110 pmid:23959903 pmcid:PMC3767554 fatcat:jsrrdrajtvetrbovkgmdxk3lse