Causes and Effects of N-Terminal Codon Bias in Bacterial Genes
D. B. Goodman, G. M. Church, S. Kosuri
Codon usage is biased in natural genes and can strongly affect heterologous expression (1). Many organisms are enriched for poorly-adapted codons at the N terminus of genes (2) (3) (4) (5) . Several studies suggest that these codons slow ribosomal elongation during initiation and lead to increased translational efficiency (2, 4, 6) . Most organisms also display reduced mRNA secondary structure at the N terminus (7) , and studies using synthetic codon gene variants have resulted in conflicting
... eories on which mechanisms are causal for expression changes (REF ALL) (8). Information about the causes and effects of codon bias has been restricted to relationships inferred from natural sequences using genome-wide correlation (2, 3, 5, 9, 10) , conservation among species (4), or relatively small libraries of synthetic genes with synonymous codon changes (3, 8, (11) (12) (13) (14) (15) . Here, we separate and quantify the factors controlling expression at the N terminus of genes in E. coli by building and measuring expression from a large synthetic library of defined sequences. We used array-based oligonucleotide libraries (16) to generate 14,234 combinations of promoters, ribosome binding sites (RBSs), and 11 N-terminal codons in front of super-folder GFP (sfGFP) on a plasmid that constitutively co-expresses mCherry ( fig. S1 ) (17) (18) (19) . The sequences for the N-terminal peptides correspond to the first 11 amino acids (including the initiating methionine) of 137 endogenous E. coli essential genes (20) that utilize the entire codon repertoire ( fig. S2 ). We expressed these sfGFP fusions from two promoters and three RBSs of varying strengths (19). We also included the natural RBS for each endogenous gene. For each combination of promoter, RBS, and peptide sequence, we designed a set of 13 codon variants to represent a wide range of codon usages and secondary structure free energies across the translation initiation region. We studied the interactions between the 5′ untranslated region (UTR) and N-terminal codon usage because initiation is thought to be the rate-limiting step for translation (1), this region has been previously implicated in determining most expression variation (8), N-terminal codons are more highly conserved (21), and rare codons are enriched at the N terminus of natural genes and especially those that are highly expressed (2). We measured DNA, RNA, and protein levels from the entire library using a multiplex assay ( Fig. 1C and figs. S3 and S4) (19). DNA and RNA levels were determined using DNASeq and RNASeq. Protein levels were determined by FlowSeq; 7327 (51.5%) constructs were within the quantitative range of our assay (R 2 = 0.955, p < 2×10 −16 ; fig. S5 ). We normalized the expression measurements across each 13-member codon variant set as fold change from log-average to control for changes in promoters, RBSs, and peptide sequence ( fig. S6) . Changing synonymous codon usage in the 11-aa N-terminal peptide resulted in a mean 60-fold increase in protein abundance from the weakest to strongest codon variant even though >96% of the gene remained unchanged. For over 160 codon variant sets (25% of sets within range), the difference was >100fold. For each codon variant set, we included sequences encoding the most common or rare synonymous codon in E. coli for every amino acid. The rare codon constructs displayed a mean 14fold (median 4-fold) increase in protein abundance compared to common codon constructs ( Fig. 1A ; p < 2×10 −16 , twotailed t test) even though common codons are generally thought to increase protein expression and fitness (1, 9, 22, 23) . To understand why rare codons cause increased expression, we first examined several codon usage metrics, but they could only explain <5% of expression differences ( fig. S7A ). New metrics that take into account both tRNA availability and usage (nTE) show stronger N-terminal enrichment (4). We calculated nTE scores for E. coli and found that nTE scores were similar to the tRNA adaptation index (tAI) (R 2 = 0.847, p < 2×10 −16 ), did not correlate well with N-terminal codon enrichment in the E. coli genome (R 2 = 0.107, p = 0.00654), and did not significantly correlate with codons that increased protein expression in our data set (R 2 = 0.024, p = 0.124). Others have proposed that slow ribosome progression at the N terminus due to rare codons increases translational efficiency (2, 13, 14) . This 'codon ramp' hypothesis should apply primarily in the context of strong translation, but we found that using rare codons at the N terminus increases expression regardless of translation strength (Fig. 1B) . Finally, ribosome occupancy profiling in E. coli has shown that tRNA abundance does not correlate to translation rate, but that specific rare codons can create internal Shine-Dalgarno-like motifs that can alter translational efficiency (6). We looked for an association between the presence of internal Shine-Dalgarno-like motifs and changes in expression, and found it to be weak but statistically significant (R 2 = 0.002, p < 1.3×10 −5 ). We built a simple linear regression model correlating the use of each individual synonymous codon with expression changes ( Fig. 2A and fig. S8 ). For most amino acids, we found a link between the rarity of the codon and increased expression (Fig. 2B ). There is a strong correlation between codons that affected expression and their relative N-terminal enrichment in E. coli (R 2 = 0.73, p < 2.3×10 −9 ; Fig. 2C ). Using relative translation efficiency instead of relative expression produced similar results ( fig. S9 ). Decreased GC-content correlated with increased protein expression (R 2 = 0.12, p < 2×10 −16 ; Fig. 3A ). Rare codons in E. coli are frequently A/T-rich at the third position, and codons ending in A/T more frequently correlate with increased expression than synonymous codons ending in G/C. (fig. S10 ). This association suggested a link to mRNA transcript Most amino acids are encoded by multiple codons, and codon choice has strong effects on protein expression. Rare codons are enriched at the N terminus of genes in most organisms, although the causes and effects of this bias are unclear. Here, we measure expression from >14,000 synthetic reporters in Escherichia coli and show that using N-terminal rare codons instead of common ones increases expression by ~14-fold (median 4-fold). We quantify how individual N-terminal codons affect expression and show that these effects shape the sequence of natural genes. Finally, we demonstrate that reduced RNA structure and not codon rarity itself is responsible for expression increases. Our observations resolve controversies over the roles of N-terminal codon bias and suggest a straightforward method for optimizing heterologous gene expression in bacteria.