Open Access
Research Article
Genome Landscapes and Bacteriophage Codon Usage
1 FAS Center for Systems Biology, Harvard University, Cambridge, Massachusetts, United States of America, 2 Lyman Laboratory of Physics, Harvard University, Cambridge, Massachusetts, United States of America, 3 Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
Abstract
Across all kingdoms of biological life, protein-coding genes exhibit unequal usage of synonymous codons. Although alternative theories abound, translational selection has been accepted as an important mechanism that shapes the patterns of codon usage in prokaryotes and simple eukaryotes. Here we analyze patterns of codon usage across 74 diverse bacteriophages that infect E. coli, P. aeruginosa, and L. lactis as their primary host. We use the concept of a “genome landscape,” which helps reveal non-trivial, long-range patterns in codon usage across a genome. We develop a series of randomization tests that allow us to interrogate the significance of one aspect of codon usage, such as GC content, while controlling for another aspect, such as adaptation to host-preferred codons. We find that 33 phage genomes exhibit highly non-random patterns in their GC3-content, use of host-preferred codons, or both. We show that the head and tail proteins of these phages exhibit significant bias towards host-preferred codons, relative to the non-structural phage proteins. Our results support the hypothesis of translational selection on viral genes for host-preferred codons, over a broad range of bacteriophages.
Author Summary
Any protein can be encoded by multiple, synonymous spellings. But organisms typically prefer one spelling over another—a phenomenon known as codon bias. Codon bias is generally understood to result from selection for synonymous spellings that increase the rate and accuracy of protein translation. In this work, we have examined the complete genomes of all sequenced viruses that infect the bacteria E. coli, P. aeruginosa, and L. lactis, and have found that many of these viral genomes also exhibit codon bias. Moreover, the degree of codon bias varies across the viral genome, as visualized using a technique called a “genome landscape.” By comparing the observed genomes to randomly drawn genomes, we demonstrate that the regions of high codon bias in these viral genomes often coincide with regions encoding structural proteins. Thus, the proteins that a virus needs to produce in high copy number utilize the same encoding as its host organism does for highly expressed proteins. Our results extend the translational theory of codon bias to the viral kingdom: parts of the viral genome are selected to obey the preferences of its host.
Citation: Lucks JB, Nelson DR, Kudla GR, Plotkin JB (2008) Genome Landscapes and Bacteriophage Codon Usage. PLoS Comput Biol 4(2): e1000001. doi:10.1371/journal.pcbi.1000001
Editor: Aviv Regev, Massachusetts Institute of Technology and Harvard University, United States of America
Received: July 17, 2007; Accepted: January 22, 2008; Published: February 29, 2008
Copyright: © 2008 Lucks et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: Work by DRN was supported by the National Science Foundation through grant DMR-0654191 and through the Harvard Materials Science and Engineering Center via grant DMR-0213805. JBL acknowledges the financial support of the Fannie and John Hertz Foundation. JBP is supported by a career award from the Burroughs Wellcome Fund.
Competing interests: The authors have declared that no competing interests exist.
* E-mail: jplotkin@sas.upenn.edu
Introduction
The genomes of most organisms exhibit significant codon bias—that is, the unequal usage of synonymous codons. There are longstanding and contradictory theories to account for such biases. Variation in codon usage between taxa, particularly within mammals, is sometimes attributed to neutral processes—such as mutational biases during DNA replication, repair, and gene conversion [1]–[4].
There are also theories for codon bias driven by selection. Some researchers have discussed codon bias as the result of selection for regulatory function mediated by ribosome pausing [5], or selection against pre-termination codons [6],[7]. However, the dominant selective theory of codon bias in organisms ranging from E. coli to Drosophila posits that preferred codons correlate with the relative abundances of isoaccepting tRNAs, thereby increasing translational efficiency [8]–[13] and accuracy [14]. This theory helps to explain why codon bias is often more extreme in highly expressed genes [15], or at highly conserved sites within a gene [14]. Translational selection may also explain variation in codon usage between genes selectively expressed in different tissues [16],[17]. However, recent work suggests that synonymous variation, particularly with respect to GC content, affects transcriptional processes as well [18].
The codon usage of viruses has also received considerable attention [19],[20], particularly in the case of bacteriophages [21]–[26]. Most work along these lines has focused on individual phages, or on the patterns of genomic codon usage across a handful of phages of the same host.
Here, we provide a systematic analysis of intragenomic variation in bacteriophage codon usage, using 74 fully sequenced viruses that infect a diverse range of bacterial hosts. Motivated by energy landscapes associated with DNA unzipping [27],[28], we develop a novel methodological tool, called a genome landscape, for studying the long-range properties of codon usage across a phage genome. We introduce a series of randomization tests that isolate different features of codon usage from each other, and from the amino acid sequence of encoded proteins. Thirty-three of the phages in our analysis are shown to exhibit non-random variation in synonymous GC content, as well as non-random variation in codons adapted for host translation, or both. Additionally, we demonstrate that phage genes encoding structural proteins are significantly more adapted to host-preferred codons compared to non-structural genes. We discuss our results in the context of translational selection and lateral gene transfer amongst phages.
Results
Genome Landscapes
We start by introducing the concept of a genome landscape, which provides a simple means for visualizing long-range correlations of sequence properties across a genome [29]. A genome landscape is simply a cumulative sum of a specified quantitative property of codons. The calculation of the cumulative sum is straightforward, and it consists of scanning over the genome sequence one codon at a time, gathering the property of each codon, and summing it with the properties of previous codons in the genome sequence. Similar cumulative sums are used in solid-state physics for, e.g., the calculation of energy levels [30]. In the case of the GC3 landscape, we have(1)
where ηGC3(m) equals one or zero, depending upon whether the mth codon ends in a G/C or A/T, respectively. Note that we subtract the genome-wide average GC3 content, , so that FGC3(0) = FGC3(N) = 0, where N is the length of the genome. In other words, we convert the genome codon sequence into a binary string of 1's and 0's according to whether each codon is of type GC3 or AT3, and we cumulatively sum this sequence to compute FGC3(m).
The interpretation of a GC3 landscape is straightforward. Regions of the genome whose landscape exhibits an uphill slope contain higher than average GC3 content, whereas regions of downhill slope contain lower than average GC3 content. The genome landscape provides an efficient visualization of long-range correlations in sequence properties across a genome, similar to the techniques introduced by Karlin [31].
Traditional visualizations of GC3 content involve moving window averages of %GC3 over the genome [32]. In order to compare these techniques with the landscape approach, we focus on the E. coli phage lambda as an illustrative example. Figure 1A shows the lambda phage GC3 landscape above its associated “GC3 histogram”. The histogram shows the GC3 content of each gene, and the width of each histogram bar reflects the length of the corresponding gene. Thus, the gene-by-gene histograms mimic a sliding window average view of nucleotide content across the genome, but focus on the contributions of individual genes to these sequence properties. Figure 1A reveals a striking pattern of lambda phage codon usage: the genome is apparently divided into two halves that contain significantly different GC3 contents [33],[34]. The large region of uphill slope on the left half of the GC3 landscape reflects the fact that the majority of the genes in this region contain an excess of codons that end in G or C. This trend is also reflected in the GC3 histogram bars, which are higher than average in the left half of the genome (Figure 1).
Figure 1. GC3 and CAI landscapes for lambda phage.
Landscapes of GC3. (left) and CAI (right) measures of codon usage in Lambda phage. Only coding sequences are considered, which when concatenated together are 40,773 bp long (see Table 2). The GC3 landscape is the mean-centered cumulative sum of the GC3 content (GC3 = 1, AT3 = 0) of codons. The CAI landscape is the mean-centered cumulative sum of the log w-value for each codon. For each landscape, a region exhibiting an uphill slope corresponds to higher than average GC3 or CAI. The horizontal purple band represents the expected amount of variation in a random walk of GC3 or AT3 choices, given by Equation 2. Both landscapes exhibit features far outside of the purple bands, indicating that the patterns of codon usage are highly non-random. Gene boundaries are represented by the bars in the histograms below each landscape. The height of the bars in the histogram indicate the GC3 and CAI values for each gene.
doi:10.1371/journal.pcbi.1000001.g001It is clear that genome landscapes contain the same information as gene-by-gene histograms. However, as has been noted before [29], genome landscapes also represent a powerful visualization tool that emphasizes genome-wide trends in sequence properties. As we demonstrate below, gene-by-gene histograms offer a mechanism by which to quantify these trends, while the landscapes offer striking views of these trends that can aid in their interpretation. In addition, GC-landscapes are directly useful for modeling physical properties of DNA unzipping [28].
Genome landscapes also provide a natural means of evaluating whether or not features of codon usage are due to random chance. Under a null model in which the η(i)'s above are chosen as independent random variables with var(η(i)) = 〈η(i)2〉−〈η(i)2〉 = Δ, one can show (see Methods) that the standard deviation of F(GC3,m) is(2)
This quantity is shown as a purple band in Figure 1. For η(i)'s chosen to be 0 or 1 at random, ΔGC3 = 1/4 and the maximum width is obtained at m = N/2. Since the scale of variation across the lambda phage GC3 landscape is much greater than its expectation under the null, we can conclude that the distribution of G/C versus A/T ending codons is highly non-random in the lambda phage genome.
We can also gain intuition about the degree of non-randomness in the GC3 landscape by considering what would happen if the lambda phage genome were to accumulate random synonymous mutations. Figure 2A shows snapshots of the lambda GC3 landscape as we simulate synonymous mutations to the genome. Between each snapshot, N synonymous mutations were introduced by picking a codon at random along the genome, and then choosing a new synonymous codon at random according to the global lambda phage codon distribution. By preserving the global codon distribution in each synonymous variation of the genome, this procedure inherently controls for any mutational bias or other source of global codon usage bias that may be present in the phage genome nucleotide content. The same is true for all randomization tests discussed in this paper. As more mutations are introduced, the GC3 landscape of the synonymously mutated lambda genome approaches the purple band, indicating that the GC3 pattern in the real lambda phage genome is highly non-random.
Figure 2. Snapshots of simulated synonymous mutation in the lambda phage genome.
(A) Shows GC3 and (B) shows CAI landscapes. In between successive snapshots (labeled by integers), N synonymous mutations are introduced into the genome and the resulting landscape is shown, where N is the number of codons in the lambda phage genome (see the Genome Landscapes section). These snapshots show that the simulated genome landscapes approach the random null model, indicated by the purple band (see Figure 1). The final CAI landscape (3) lies almost completely within the purple band. Using the lambda phage mutation rate of 7.7×10−8 mutations/bp/replication [57], we can estimate that approximately 107 genome replications would be required to relax within the purple bars.
doi:10.1371/journal.pcbi.1000001.g002The procedure of producing a genome landscape can be applied to other properties of codon usage. In addition to GC3, we will study patterns in the Codon Adaptation Index (CAI). CAI measures the similarity of a gene's codon usage to the ‘preferred’ codons of an organism [35]—in this case, the host bacterium of the phage under study. Every bacterium has a preferred set of codons defined as the codons, one for each amino acid, that occur most frequently in genes that are translated at high abundance. These genes are often taken to be the ribosomal proteins and translational elongation factors [35] (see Methods).
In order to calculate CAI, the preferred codons are each assigned a weight w = 1. The remaining codons are assigned weights according to their frequency in the highly-translated genes, relative to the frequency of the w = 1 codon. The CAI of a gene is defined as the geometric mean of the w-values for its codons(3)
where wi is the w-value of the ith codon, and M is the length of the gene. This quantity can be re-written as(4)
The latter formulation is more useful for calculating genome landscapes, because the argument of the exponential function is now a sum of the logs of the w-values. Therefore, we define the CAI landscape as(5)
where ηCAI(m) = ln(wm).
The CAI landscape for lambda phage is shown in Figure 1B, along with the CAI histogram of lambda phage. For the CAI histograms, the height of each bar represents the CAI value of that gene (Equation 3). As in the case with the GC3 landscape, we find that the lambda phage CAI landscape corresponds closely to the CAI histogram, but it offers a more striking global view of the long-range CAI structure in the lambda phage genome. One contiguous half of the lambda phage genome exhibits elevated CAI, whereas the other half exhibits depressed CAI. The observed CAI landscape lies far outside the purple band in Figure 1, calculated according to Equation 2, indicating that the pattern of CAI across the lambda phage genome is non-random. However, the purple band is wider for the CAI landscape than for the GC3 landscape, because the variance in the ln(wi)'s, ΔCAI, is greater than ΔGC3.
The GC3 and CAI landscapes for lambda phage are highly correlated with each other (Figure 1). In particular they both have large uphill regions on the left-hand side of the genome, indicating a region containing codons with elevated GC3-content and CAI values, compared to the genome average. It is possible that the observed correlation between the GC3 and CAI landscapes could be caused by the conflation between high CAI and GC3 in the preferred E. coli codons, as we discuss below.
We note that the genes in the region of elevated CAI primarily encode the highly translated structural proteins that form the capsid and tail of the lambda phage virions. This pattern suggests the hypothesis that, because of the need to produce structural genes in high copy number during the viral life cycle, structural genes preferentially use codons that match the host's preferred set of codons. We will explore this translational-selection hypothesis in greater detail below.
The Effect of Amino Acid Content on Genome Landscapes
The previous section illustrated that the codon usage across the lambda phage genome is highly non-random with respect to both GC3 and CAI. In this section we quantify this statement, and we focus on aspects of lambda's codon usage patterns that are independent of the amino acid sequences of the encoded proteins.
Since we are interested in studying the patterns of synonymous codon usage, it is important that we control for the amino acid sequence of encoded proteins. Phages utilize a diverse spectrum of proteins, ranging from those that form the protective capsid for nascent progeny, to those encoding for the tail and tail fibers, to those that regulate the switch between lytic or lysogenic infection pathways. As with other organisms, phage proteins have been selected at the amino acid level for function and folding. Some portion of a phage's codon usage is surely influenced by selection for amino acid content.
We can construct a simple randomization test to interrogate the potential influence of the amino acid sequence on the GC3 and CAI landscapes of lambda phage. In this test, we generate random genomes that have the exact same amino acid sequence as lambda phage, but shuffled codons, such that the genome-wide, or global, codon distribution is preserved in each random genome (see Methods). As summarized in Table 1, we refer to this test as the ‘aqua’ randomization test. For each of the randomized genomes, we calculate GC3 and CAI landscape. Similar to a recent randomization method [36], we then compare the observed landscape of the actual genome to the distribution of landscapes generated from the randomized genomes.
Figure 3 shows the results of this comparison, with the observed landscapes plotted as black lines, and the mean±one and two standard deviations of random trials shown in dark and light aqua, respectively. As the figures show, the observed landscapes lie in the far extremes of the randomized distributions – indicating that the amino acid sequence of the lambda phage genome does not determine the extraordinary features of the observed landscapes.
Figure 3. Observed and randomized landscapes for lambda phage.
The figure shows the observed GC3 (left) and CAI (right) landscapes, plotted in black, along with the mean±1, and ±2 standard deviations of randomized trials, shown in aqua (bold line, dark and light regions, respectively). The aqua randomization test shown here draws random synonymous codons that preserve the exact amino acid sequence, according to probabilities that preserve the global codon usage distribution of the lambda genome. For the most part, the observed landscapes lie significantly outside the distribution of randomized landscapes–implying that the amino acid content of genes is not responsible for the observed pattern of the CAI landscape. In the lower panel, however, genes whose GC3 (left) or CAI (right) values fall between the 0.025 and 0.975 quantile of the random trials are shadowed in grey; the GC3/CAI values of such genes are not significantly different from random, given their amino acid sequence.
doi:10.1371/journal.pcbi.1000001.g003It is also instructive to query the influence of amino acid content on codon usage in each gene individually. The histogram view of these randomization tests allows us to ask this question precisely. Because the amino acid sequence is preserved exactly across the genome, each histogram bar in Figure 3 can be considered as its own randomization test, one for each gene. The position of the horizontal black bar reflects the actual codon usage of each gene, and it can be compared to the distribution of random trials in order to compute a quantile for each gene:(6)
Note that we have defined two quantiles, q> and q<, that describe the proportion of random trials strictly less or strictly greater than the observed data. These two quantities sum to a values less than one (and equal to one if there are no ties). A value of q>>0.5 signifies that the observed statistic (e.g. GC3 or CAI) is greater than most of the random trials.
Associated with each of these quantiles is a p-value quantifying whether the observed gene sequence has significantly different codon usage than the random trials: p< = 1−q< and p> = 1−q>. If either one of these p-values is low, it signifies that the GC3 (or CAI) content of the gene is significantly different than the genomic average, controlling for the amino acid sequence of the gene. p< tests for significantly depressed GC3 (or CAI) in a gene; and p> tests for significantly elevated GC3 (or CAI) in a gene. We will use these p-values, which arise from the ‘aqua’ randomization test, in two ways.
Since we are interested in studying the effects of synonymous codon usage alone, we first wish to filter out any genes whose codon usage does not significantly deviate from random, given the amino acid sequence. Therefore, in the subsequent gene-by-gene analyses reported in this paper, we retain only those genes whose quantiles fall in the extreme 5% of random trials. That is, we only keep those genes for which or
. These genes are said to ‘pass’ the aqua test, and they are unshaded in Figure 3.
We also use the gene-by-gene p-values to quantify the degree to which codon usage is independent of amino acid sequence across the genome as a whole. To do so, we combine all the gene-by-gene p-values into an aggregate p-value for the entire genome, paqua, using the method of Fisher [37]. We calculate the combined p-value by summing the logs of twice the minimum of each gene-specific p-value(7)
where represents the aqua p<-value for gene i, and k is the number of genes in the genome. It is well known that faqua is chi-squared distributed with 2k degrees of freedom [37]. Thus, the combined p-value for the entire genome,
, where
is the cumulative chi-squared distribution with 2k degrees of freedom. In the case of lambda phage, we find
for GC3 and
for CAI. Thus, we conclude that the neither the GC3 nor the CAI patterns across the lambda phage genome are determined by the genome's amino acid sequence.
In the following sections we will use the aqua test (see Table 1) and its associated gene-by-gene and combined p-values as a control to verify that features of codon usage are not driven by the amino acid sequence.
Disentangling CAI from GC3
Depending upon the preferred codons of the host species, the effect of selection for high CAI in a viral gene is not necessarily independent from the effect of selection for other features of viral codon usage, such as high GC3. For example, codons with high CAI values associated with a given host may be biased towards high GC3 values as well (see Figure 4). It is important, therefore, to disentangle the effects of selection for CAI versus selection for GC3, in order to determine which one of these forces is responsible for the non-random patterns of codon usage observed in the lambda genome.
Figure 4. E. coli codon usage master table.
The table of 61 codons along with their associated w-values is shown for E. coli. The w-value of each codon reflects its frequency in highly transcribed E. coli genes (see main text). The Table 1 is divided into four regions: codons with high CAI (w≥0.9) ending in G or C (dark red); codons with high CAI ending in A or T (dark blue); codons with low CAI (w<0.9) ending in G or C (light red); codons with low CAI ending in A or T (light blue). As the table shows, there is a slight bias for GC3 in the high-CAI codons (58%), and slight bias away from GC3 in the low-CAI codons (48%).
doi:10.1371/journal.pcbi.1000001.g004The weights used to compute CAI for E. coli are shown in Figure 4. The 61 codons are placed into one of four groups according to whether they are GC3 or not (red or blue, respectively), and whether they have high CAI or not (dark or light, respectively). High CAI is determined by an arbitrary cutoff of w≥0.9. As this table demonstrates, the set of preferred codons in E. coli is slightly biased towards GC-ending codons (58%).
The GC bias of preferred codons, although slight, could conflate the results of selection for CAI versus GC3 in phages that infect E. coli, such as lambda. We therefore introduce another randomization test that allows us to disentangle patterns of CAI content from patterns of GC3 content. Similar to the aqua randomization test described above, we draw random phage genomes such that the amino acid sequence is conserved, but we add the additional constraint of conserving the exact GC3 sequence as well (see Methods). For example, at a site containing a GC3 codon for leucine, in our random trials we only allow those leucine codons terminating in G or C. By comparing the observed landscapes of the genome with the distribution of randomly drawn landscapes, we can isolate the features of codon usage driven by CAI, independent of GC3 and amino acid content. We refer to this randomization procedure at the ‘orange’ randomization test (Table 1).
Conversely, we also wish to assess the strength of patterns in GC3 content, independent of CAI and amino acid content. The appropriate randomization procedure in this case requires that we constrain the amino acid sequence and the sequence of codon CAI values while allowing GC3 to vary. However, because CAI values are not binary, CAI cannot be constrained exactly while still allowing for enough variability to produce a meaningful randomization test. Thus, we introduce a binary version of the CAI measure, called BCAI, that is qualitatively the same as and, for our purposes, interchangeable with CAI.
The BCAI w-value for a codon is defined to be 0.7 if the codon is high CAI, and 0.3 if the codon has low CAI. High CAI is defined by the threshold of w≥0.9 (see Figure 4). The threshold value w≥0.9 is arbitrary, and our results are robust to changing this threshold (see Figures S1 and S2). Our use of the term ‘binary’ here refers to the binary classification scheme and not the particular values of BCAI. The actual values assigned for BCAI are arbitrary, for the most part, and have no effect on our results. Nevertheless, we cannot assign low BCAI a value of zero, because this value would be problematic when included in the geometric averaging procedure, or when computing the logarithm of w-values for BCAI landscapes.
BCAI provides a useful surrogate for CAI because its values are binary, thereby allowing us to constrain a gene's amino acid sequence and BCAI sequence exactly, while varying GC3 content in random trials. The BCAI landscapes and histograms are calculated in the same way as CAI landscapes and histograms, except using BCAI w-values. As expected, the BCAI landscape of a genome is qualitatively similar to its CAI landscape (compare Figures 5B and 3B), and the two landscapes are highly correlated (e.g. r = 0.72 for lambda phage). Thus BCAI is interchangeable with CAI for the purposes of our randomization tests.
Figure 5. Observed and randomized landscapes for lambda phage.
Observed landscapes are shown along with randomized landscapes associated with the green and orange tests. The green randomization procedure tests the significance of the GC3 landscape controlling for the observed CAI (actually, BCAI) variation across the genome. The orange randomization procedure tests the significance of the BCAI landscape, controlling for the observed GC3 variation across the genome. Both tests preserve the amino-acid sequence exactly. Both observed landscapes lie outside the distribution of random trials, indicating there is non-random GC3 content controlling for CAI, and non-random CAI content controlling for GC3.
doi:10.1371/journal.pcbi.1000001.g005Figure 5 shows the results of the two randomization tests outlined above: the ‘green’ test that compares the observed GC3 landscape to a distribution of random trials constraining the amino acid sequence and the BCAI sequence; and the ‘orange’ test that compares the observed BCAI landscape to a distribution of random trials constraining the amino acid sequence and the GC3 sequence. Our convention for naming these two tests is summarized in Table 1.
As seen in Figure 5A, the observed GC3 landscape lies significantly outside of the random trials that preserve amino acid sequence and BCAI sequence. Combining the gene-by-gene p-values for this test, we find – indicating that the lambda phage genome as a whole has non-random GC3 variation independent of amino acid and CAI (actually, BCAI) sequence. Conversely, Figure 5B shows that the BCAI landscape contains non-random features when controlling for both GC3 and amino acid sequence (
). In other words, the lambda phage genome exhibits highly non-random patterns of both GC3 and CAI codon variation, independent of one another and independent of the amino acid sequence.
Non-Random Patterns of CAI and GC3 in Bacteriophages
In the sections above we have demonstrated and quantified highly non-random patterns of GC3 and CAI codon usage variation across the lambda phage genome. We have also demonstrated that these trends are independent of one another. In this section, we will extend our analysis to a large range of diverse phages.
In this section we consider all sequenced phages that infect E. coli, Pseudomonas aeruginosa or Lactococcus lactis as their primary host. The latter two hosts were chosen because of they contain unusually extreme GC3 content: 88 %GC3 for P. aeurginosa and 25 %GC3 for L. lactis, genome-wide. The extreme GC3 content of these hosts give rise to opposing relationships between high CAI and GC3 – as indicated schematically in Figure 6. In particular, P. aeruginosa strongly favors GC3 in high-CAI codons (94%), and L. lactis strongly favors AT3 in high-CAI codons (72%). Thus, these three hosts span a large spectrum of relationships between CAI and GC3. Since our randomization tests constrain amino acid and BCAI exactly (the ‘green’ test), and amino acids and GC3 exactly (the ‘orange’ test), we can control for any possible conflation between GC3 and CAI trends. Thus, the randomization tests are equally applicable to all of the phage genomes, regardless of their host.
Figure 6. Schematics of preferred codon usage tables for E. coli, P. aeruginosa, and L. lactis following the conventions of Figure 4.
Unlike E. coli, P. aeruginosa strongly favors GC3 in high-CAI codons (94%), and L. lactis strongly favors AT3 in high-CAI codons (72%).
doi:10.1371/journal.pcbi.1000001.g006We performed the aqua, green, and orange randomization tests on the 45 phages of E. coli, 12 phages of P. aeruginosa, and 17 phages of L. lactis whose genomes have been sequenced (see Methods). In the first step of our analysis, we removed any phages which failed either the aqua GC3 or aqua CAI tests, because the codon usage of such genomes are influenced by their amino acid sequence. A phage was said to pass these two control tests if its Fisher combined p-values for both aqua GC3 and aqua CAI were significant. The significance criterion for each test is pcombined<5%/74, which incorporates a Bonferroni correction for multiple tests. With this cutoff, 50 of the initial 74 phages passed the aqua control tests.
Figure 7 shows results of these tests for several example genomes. P2, a temperate phage, and T3, a non-temperate phage both infect E. coli and both pass the control tests and exhibit significant ‘orange’ and ‘green’ results, as does D3112, a temperate phage that infects P. aeruginosa. However, not all phages that pass the control test exhibit significant ‘orange’ and ‘green’ results – as evidenced by bIL286, a temperate phage infecting L. lactis.
Figure 7. Green (left) and orange (right) randomization tests for several phages.
Bacteriophages P2 (A) and T3 (B) both infect E. coli. Phage D3112 (C) infects P. aeruginosa. Phage bIL286 (D) infects L. lactis. T3 is the only non-temperate phage of this group. See Table 2 for combined Fisher p-values for these tests. In the case of bIL286, note the lack of evidence for codon bias evident in the green and orange tests for bIL286, as confirmed by the insignificant p-values in Table 2. In this case, we cannot rule out the possibility that the observed pattern in GC3 is determined completely by the amino acid and CAI sequence (green), or that the observed pattern in CAI is determined by the amino acid and GC3 sequence (orange).
doi:10.1371/journal.pcbi.1000001.g007Figure 8 plots the distribution of combined Fisher p-values of the orange and green tests, for the 50 phages that pass the control tests. The majority of these p-values are highly significant. Using a Bonferoni-corrected threshold of 5%/50, a total of 22 genomes show significance in the orange test, 29 in the green test, and 17 in both orange and green. These results indicate that non-random patterns in codon usage are not unique to lambda phage. Indeed, over a range of bacterial hosts and a range of phage viruses, there is apparent pressure for non-random patterns of both GC3 content and CAI content, independent of one another and independent of the amino acid sequence.
Figure 8. Combined Fisher p-values for the green and orange randomization tests across 50 phage genomes.
Phage names are listed on the x-axis, and are sorted by their orange p-value. A total of 29 genomes exhibit non-random GC3 content controlling for CAI (green test); and a total of 22 genome exhibit non-random CAI content controlling for GC3 (orange test). 17 genomes pass both of these tests. The dashed horizontal line indicates the threshold for significance after Bonfernni correction (i.e. 5%/50). Upwards arrows indicate p-values that lie beyond the limits of the y-axis. See Table 2 for phage properties, including the p-values for these tests. Twenty four phage genomes that failed the aqua GC3 or CAI control tests are not included in this figure.
doi:10.1371/journal.pcbi.1000001.g008Translational Selection on Phage Structural Proteins
In this section, we investigate a natural hypothesis concerning the patterns of non-random CAI usage we have observed in phage genomes – namely, that these patterns may be driven by selection for translational accuracy and efficiency, which is stronger in more highly expressed proteins [9],[21].
Among all phage proteins, the structural proteins are the most highly expressed [38]. The structural proteins form the protective capsid that encloses the viral genome, as well as the tail, which is often used for transmission of the phage genome to the inside of the host [39]. These proteins must be produced in high copy number – many tens of copies of each type of structural protein needed to form each of hundreds of viral progeny [38]. For each gene in a phage genome, we assigned a structural annotation of 1 if the gene was known to encode a structural protein and 0 otherwise (see Methods).
According to the standard hypothesis of translational selection, the structural genes of phages should exhibit elevated CAI levels compared to other phage genes, since they are translated (by the host) in high copy numbers. To test this hypothesis, we performed regressions between the structural annotation of phage genes and their aqua CAI and orange BCAI p-values. In other words, we compared the structural properties of genes against their CAI content, controlling for amino acid sequence, and against their BCAI content, controlling for both amino acid sequence and GC3 sequence.
In the case of lambda phage, Figure 9 shows the results of the aqua CAI and orange BCAI randomization tests, with the structural genes highlighted. The plot reveals a striking pattern: the vast majority of the structural proteins lie on the left half of the genome, exactly in the region where genes have elevated CAI values. In order to quantify this association we performed ANOVAs. Before regressing structural annotations against codon usage, we first removed the non-informative genes – i.e. genes whose codon usage are influenced by their amino acid content, as indicated by a failure to pass the aqua CAI test.
Figure 9. The relationship between codon usage and protein function in lambda phage.
The figure shows the aqua (CAI, as in Figure 3) and orange (BCAI, as in Figure 5) randomization tests overlaid with information about protein function: genes classified as structural are shown with a white background and all other genes with a grey background. The histograms indicate a clear relationship between the structural classification of a gene and its significance under the aqua and orange tests: structural genes typically have elevated quantiles in the aqua test, whereas other genes typically have depressed quantiles. In other words, structural genes exhibit elevated CAI values when controlling for their amino acid sequence, compared to codon usage in the genome as a whole. Moreover, as the orange histograms indicate, this pattern is not caused by variation in GC3 content: the structural genes exhibit elevated BCAI values after controlling for both their amino acid sequence and their GC3 sequence.
doi:10.1371/journal.pcbi.1000001.g009Table 3 shows the results of the regression between aqua CAI and orange BCAI p>-values versus structural annotations in lambda phage. The results are highly significant: structural annotations explain half of the variation in CAI, even when controlling for genes' amino acid sequences (aqua, r2 = 56%) as well as GC3 sequences (orange test, r2 = 46%). The median p>-value among structural genes is close to zero, whereas the median p>-value among non-structural genes is close to one – indicating that structural genes exhibit significantly elevated CAI values. These highly significant results are consistent with the hypothesis of translational selection on structural proteins.
Table 4. Comparison between codon usage and refined structural annotations.
doi:10.1371/journal.pcbi.1000001.t004In order to examine the relationship between structural annotation and CAI across all 74 phages in our study, we performed the same ANOVA on the 1,309 informative genes (i.e. genes that pass the aqua CAI randomization test). Once again, Table 3 shows a highly significant relationship between structural annotation and CAI values, controlling for amino acid content and GC3. Thus, the tendency toward elevated CAI values in structural genes holds across all the phages in this study, despite the fact that they infect a diverse range of hosts with a wide variety of GC contents.
Start a discussion on this article