Forces Shaping the Fastest Evolving Regions in the Human Genome
We define a human diff as a base where (i) chimp, mouse, and rat have the same nucleotide, (ii) human has a different nucleotide, (iii) the ancestral consensus sequence is not in a CpG dinucleotide, and (iv) the chimp base is high quality (Phred score and in an 11 base window with no indels, no more than 2 human/chimp differences, and all Phred scores ). If the chimp base is not high quality, we refer to the base as a low quality human diff. The number of human diffs in an element can be
... d to the expected number if the element were evolving neutrally in the human lineage: 0.67 diffs per 100bp, based on a genome-wide estimate of P(human≠chimp | chimp=mouse=rat). Table S3 contains the distribution of the number of human diffs in our set of 96% conserved regions. Nearly 24% of the regions have at least one human diff. In addition, they contain between 0 and 6 human indel events (mean=0.38), affecting between 0 and 28 bases (mean=0.52). 30 ≥ 25 ≥ S1.2 Model for nucleotide evolution. The methods we use to detect substitution rate acceleration all make use of a fitted molecular evolutionary model. For this purpose, we use a general reversible singlenucleotide model (REV) 1 with parameters estimated from a genome-wide data set of evolutionarily conserved bases constituting approximately of the human genome. These sites were identified in an independent analysis of a 17 species multiple alignment of human, chimp, macaque, mouse, rat, rabbit, cow, dog, armadillo, elephant, tenrec, opossum, chicken, frog, fugu, tetraodon, and zebrafish using the methods described in Siepel et al. 3% 2 . Using the 17 species alignments and a topology for their phylogenetic tree, we globally estimate the free model parameters (rate matrix, branch lengths) using the phyloFit program 3 . Summaries of the estimated parameters are given in Figure S1 . We call this model the CONS model. The REV model is the most general model for nucleotide substitution subject to the time-reversibility constraint. This model implies a particular parameterization of the rate matrix for nucleotide substitutions, which has four nucleotide frequencies and five rate parameters. Other parameterizations of the substitution rate matrix could be considered. However, we selected the REV model because of its general applicability and do not expect the results to depend heavily on this choice. Note that for simplicity we use a model that assumes independence between bases. Employing a context dependent model would allow us to easily model substitution patterns over two or three adjacent bases, alleviating the need to remove CpG dinucleotides from the data set. However, a context dependent model might present problems with estimation in this setting. It remains to be shown whether employing a more heavily parameterized model will in fact provide more power to detect human-specific changes. Our model for nucleotide evolution could be modified to include indels, allowing us to utilize rather than discard information from alignment columns with gaps. We explored the possibility of treating gaps as a fifth character and identified a number of genomic elements with more human-specific indels than expected (data not shown). One serious drawback to this approach, however, is that insertions and deletions often affect 1 more than one adjacent base, violating the assumption of independence between bases. Consequently, large indels are assigned a higher probability then they deserve. A solution would be to model indel events, rather than indel bases (see ref. 3 for some work on this problem). S1.3 Likelihood Ratio Test. LRT statistics. For each region, we compute the likelihood ratio test (LRT) statistic as follows. First, we fit two models to the multiple alignment data for the region. Both are scaled versions of the CONS model, a technique that avoids re-estimation all model parameters on a small amount of data. The null model has a single scale parameter representing a shortening (more conserved) or lengthening (less conserved) of all branches in the CONS tree. The alternative model has an additional parameter for the human branch, which is constrained to be . This extra parameter allows the human branch to be relatively longer (less conserved) than the branches in the rest of the tree. Both models are fit using the phyloFit function (phast library) with the --init-model and --scale-only options. The model with a human rate parameter is fit with the additional option --scale-subtree human:loss. The LRT statistic is the log ratio of the likelihood of the alternative model to that of the null model. Regions are ranked based on the magnitude of the LRT statistic, with larger values indicating more evidence for acceleration in human. The ranking on LRT statistics agrees well with other methods 1 ≥ . LRT p -values. It is also of interest to assign a measure of statistical significance to each LRT statistic. We compute empirical p -values by simulation from the CONS model. One million simulated data sets are generated using the phyloBoot program (phast library). These are of variable lengths (median=140, as in the observed data). For each simulated data set, the LRT statistic is computed as above. The distributions of these statistics are similar for different length elements, so we pool all simulated LRT statistics to form a single null distribution. For each observed LRT statistic, the empirical p -value is the proportion of simulated data sets with a larger LRT statistic. Note that the smallest p -value that can be estimated by this method ( here) depends on the number of simulated data sets. For observed LRT statistics that exceed all simulated LRT statistics, we can only say . Computational burden prevents more precise estimation. S1.4 Multiple comparisons. Because the genomic regions we study in this paper are on different chromosomes or are separated by significant distance we can assume independence of their nucleotide substitution processes. This assumed independence allows us to employ a simple multiple testing correction throughout this study, the Benjamini & Hochberg False Discovery Rate (FDR) controlling procedure 5 , which requires independence or weak dependence between tests. The smallest FDR adjusted p -value that we can compute in the LRT is . 4 4.5 − e 2 S1.5 Filtering. In order to illustrate the types of erroneous elements that would be found in the list of HARs if we did not perform filtering (Section 4.3), we describe the following elements that were removed from the analysis. One high probability element, hg17.chr13:22,408,812-22,408,911, has a paralog in chimp and human, but not the rodents. This element is eliminated because we could not determine conclusively (due to gaps in the chimp assembly) which chimp sequence should align to each human sequence. There are two relatively significant elements that contain multiple adjacent human changes: hg17.chr10: 127,180,121-127,180,192 and hg17.chr11:118,305,215-118,305,352. In both cases, recomputing the element ranking without the adjacent changes seriously reduces the overall significance of the element (regardless of which specific base is retained). Hence, we eliminate these elements from further study. For two elements, hg17.chrX:95,820,769-95,820,993 and hg17.chrX:95,820,618-95,820,729, both of which fall in an intron of the DIAPH2 (O60879) gene, the chimpanzee reads in NCBI in fact agree with the human sequence. This suggests that the chimp whole genome assembly may be incorrect at this position and the substitutions are primate specific, but not human-specific. Furthermore, the macaque sequence in these two elements agrees with human, supporting the chimp reads and not the chimp assembly. We believe that the chimpanzee sequence in the corresponding Contig #300019 is an assembly error, potentially caused by mouse contamination of the chimpanzee library. One high scoring element, hg17.chr18:74,236,384-74,236,609, is removed based on contradictory findings in our resequencing data. A 4bp deletion in the human genome relative to the chimp and rodent genomes is not found in any of the humans in the PDR panel. Furthermore, all reads in the NCBI trace repository also do not have the deletion. This suggests that it is either a rare mutation or an assembly or read error. Because the apparent human-specific changes in this element are explained by a shift in the alignment due to this questionable indel, we remove the element from the analysis. S1.6 Background substitution rates based on ENCODE data. Background substitution rates are estimated using 4-fold degenerate (4d) sites in the ENCODE regions 6 (http://www.genome.gov/10005107), which cover 1% of the human genome. We fit a REV substitution model 1 to 4d sites from all ENCODE regions and from the five ENCODE regions that fall in the last (distal) band of their chromosomes. The rate matrix and GC content parameters were adjusted to correct for known bias in 4d sites using genome-wide estimates from ancestral repeats. These regions have a similar distribution of distances to the chromosome end as the HAR elements, making their 4d sites a suitable data set to estimate background substitution rates near chromosome ends. For each fitted model, we compute the posterior expected value of the number of substitutions on each lineage with the program phyloP with option --subtree (phast library). The background rates in the human-chimp tree are compared to the estimated chimp and human rates in the HAR elements.