Conceived and designed the experiments: JB. Performed the experiments: JB. Analyzed the data: JB ES DU. Contributed reagents/materials/analysis tools: JB. Wrote the paper: JB ES DU. Revised and drafted the manuscript: DU.
The authors have declared that no competing interests exist.
Oligonucleotide usage in archaeal and bacterial genomes can be linked to a number of properties, including codon usage (trinucleotides), DNA base-stacking energy (dinucleotides), and DNA structural conformation (di- to tetranucleotides). We wanted to assess the statistical information potential of different DNA ‘word-sizes’ and explore how oligonucleotide frequencies differ in coding and non-coding regions. In addition, we used oligonucleotide frequencies to investigate DNA composition and how DNA sequence patterns change within and between prokaryotic organisms. Among the results found was that prokaryotic chromosomes can be described by hexanucleotide frequencies, suggesting that prokaryotic DNA is predominantly short range correlated, i.e., information in prokaryotic genomes is encoded in short oligonucleotides. Oligonucleotide usage varied more within AT-rich and host-associated genomes than in GC-rich and free-living genomes, and this variation was mainly located in non-coding regions. Bias (selectional pressure) in tetranucleotide usage correlated with GC content, and coding regions were more biased than non-coding regions. Non-coding regions were also found to be approximately 5.5% more AT-rich than coding regions, on average, in the 402 chromosomes examined. Pronounced DNA compositional differences were found both within and between AT-rich and GC-rich genomes. GC-rich genomes were more similar and biased in terms of tetranucleotide usage in non-coding regions than AT-rich genomes. The differences found between AT-rich and GC-rich genomes may possibly be attributed to lifestyle, since tetranucleotide usage within host-associated bacteria was, on average, more dissimilar and less biased than free-living archaea and bacteria.
There are potentially many factors responsible for how archaeal and bacterial genomes are composed. Recent advances in DNA sequencing have made it possible to use computational and statistical methods to examine the interplay between evolution and genomic composition. We wished to see whether particular properties could be extracted that would provide clues on how prokaryotic DNA is composed. For instance, we wondered whether or not protein coding regions carried a greater information potential than non-coding regions, if there is a link between genome size and GC content, whether GC content is different in coding and non-coding regions, and possible associations between DNA composition and environment. Our results indicated that genomic nucleotide frequencies are a determinant of many DNA compositional properties, but also that other influences are at work. For instance, bacteria are known to frequently exchange DNA with the environment and other organisms. Acquired DNA can therefore have different compositional properties than host DNA, and since pathogenicity and antibiotic resistance in bacteria is often associated with foreign DNA, advancing the knowledge of DNA composition is of great importance.
Prokaryotic DNA can be considered as a long chain of overlapping oligonucleotides, and frequencies of differently sized oligonucleotides can reveal different properties and patterns of bacterial and archaeal genomes. On average, roughly 86% of prokaryotic DNA codes for proteins
Oligonucleotide frequencies are very much influenced by codon distributions which, in turn, are correlated with GC-content
In addition to the properties described above, transcription and regulation sites are also coded by certain oligonucleotide patterns
The examples above illustrate some of the properties that can be extracted from DNA sequences by examining oligonucleotide usage variance. This motivated us to explore how oligonucleotide distributions change within and between prokaryotes in coding and non-coding regions, how biased oligonucleotide frequencies are, and whether any particular trends could be detected. In order to do this a series of statistical tests were performed on all sequenced bacterial and archaeal chromosomes (up to September 2006). We found that tetranucleotide frequencies carried considerable genomic information potential, and were therefore used in all statistical tests based on oligonucleotide usage. The statistical tests included oligonucleotide usage deviation from mean (OUD, a measure of oligonucleotide frequency variations in genomes), and oligonucleotide usage variance from expected (OUV, a measure of randomness or bias
We measured the statistical information carried by the differently sized oligonucleotides from di- to octanucleotides in prokaryotes with GC contents between 47% and 53%. From
Cumulative information potential is measured in di- to octanucleotide frequencies in prokaryotic genomes with GC content between 47% and 53%. These genomes were selected because of the increased sensitivity of the Pearson correlation measure for chromosomes with similar AT/GC content. The archaeal and bacterial chromosomes are represented along the horizontal axis, sorted by increasing GC content from left to right, with corresponding correlation scores between observed
GC content was measured in coding and non-coding regions (see
Why non-coding regions are ∼5.5% more AT rich than coding regions may be related to the DNA curvature in promoter regions, and possibly termination sites
Although the link between GC content and genome size has been debated
Prokaryotic chromosomes are sorted by increasing GC content from left to right on the horizontal axis. The vertical axis represents chromosome size in mbp.
We measured how tetranucleotide usage varied in genomes compared with expected tetranucleotide usage. This expected tetranucleotide usage was calculated from mean genomic GC content, and implicitly assumes that each nucleotide in every tetranucleotide, and therefore also the whole chromosome, is independent of its neighbors. In other words, the more similar observed and expected tetranucleotide frequencies are, the more random (i.e. less biased) are the observed tetranucleotide frequencies, and thus the genomic DNA composition.
Prokaryotic chromosomes are sorted by increasing GC content from left to right (vertical axis), with red and blue regression lines representing OUV values (horizontal axis) for chromosomes and coding regions, respectively. Larger values imply more bias, or stronger selectional pressure, in genomic tetranucleotide usage. The surrounding dotted lines indicate 99% prediction intervals.
Thus, bias in tetranucleotide usage in non-coding regions was less affected by global GC content than coding regions.
Preliminary tests on a set of sequenced genomes involving randomization by increasing AT/GC content similarly to non-coding % size produced significantly larger differences in tetranucleotide usage bias than what was observed for non-coding regions. This indicates that non-coding regions do carry information and are exposed to selectional pressures and bias although considerably less than coding regions. This can be seen from
OUD gives a measure of how homogeneous or heterogeneous genomes are in terms of DNA composition. The OUD (but not OUV) measure can also detect to what extent tetranucleotide patterns are distributed throughout the genome. Low OUD values thus indicate increased similarity and not increased randomness. In contrast to OUV, the OUD measure is calculated as the average variance of oligonucleotide occurrences within the chromosome based on oligonucleotide frequencies from a non-overlapping 40 kbp sliding window compared with mean genomic oligonucleotide frequencies.
From
Each archaeal and bacterial chromosome is sorted on the horizontal axis with increasing GC content from left to right, with corresponding OUD values on the vertical axis. The red and blue lines represent OUD scores for whole chromosomes and coding regions, respectively. Smaller OUD values mean more homogeneous chromosomes in terms of tetranucleotide usage. The surrounding dotted lines indicate 99% prediction intervals.
Difference in tetranucleotide usage within genomes was supported by a ratio test where observed tetranucleotide usage variance within genomes was divided by expected tetranucleotide variance approximated by nucleotide frequencies. From
The vertical axis shows the ratio of observed divided by expected OUD values for each chromosome sorted with respect to genomic GC content from left to right on the horizontal axis. The ratio test measures how observed oligonucleotide usage varies within chromosomes (red line) and coding regions (blue line) compared with expected based on GC content. Rising ratio values above 1 (vertical axis) means increased observed variance compared with expected. The dotted lines represent 99% prediction intervals.
This variance is, however, connected to larger fluctuations in GC content within genomes in AT rich prokaryotes. See the below section on variance in GC content within genomes for more detail. No correlation was found between OUD values and genome size.
Predicted tetranucleotide usage based on genomic nucleotide frequencies was used to estimate variance in GC content within genomes (
The vertical axis shows the variance of nucleotide frequencies within chromosomes (red line) and coding regions (blue line) compared with corresponding mean genomic GC content on the horizontal axis. Lower average nucleotide variance scores (vertical axis) means more similar distributions of GC content within chromosomes (and vice versa).
This result showed that there was significant correlation between global GC content and how GC content varied locally within genomes. The obtained correlation was also higher than what was observed for the OUD measure (
Restricting the test to coding sections, correlation was found between global GC content and expected tetranucleotide usage variance, using the following equation:
Thus, only weak correlation was found between fluctuations in intrinsic nucleotide frequencies and global GC content in coding regions. Since roughly 86% of prokaryotic DNA codes for proteins, on average,
Although a link was established between genomic GC content and OUV, (
Considering prokaryotic genomes from an oligonucleotide perspective, there seemed to be little increase in information potential in oligonucleotide sizes larger than hexanucleotides.
Comparing observed to expected tetranucleotide usage (OUV), we found that coding regions are, in general, more biased than non-coding regions, and more homogeneous according to the OUD test. GC rich genomes were also found to have more biased tetranucleotide frequencies than AT rich genomes. Although AT content increased in non-coding regions in both AT rich and GC rich genomes, it did not appear to be a consequence of tetranucleotide preference. OUD was found to decrease with increasing mean genomic GC content, indicating that GC rich genomes have a more homogeneous DNA composition, especially in non-coding regions. This result was additionally supported by a ratio test based on observed and expected OUD scores indicating that coding regions varied similarly for AT and GC rich genomes alike, while non-coding regions varied more within AT rich genomes. No correlation was found between OUV and OUD scores, implying that bias in tetranucleotide usage is not connected to intra-chromosomal homogeneity in prokaryotes.
All sequenced archaea and bacteria available up to September 2006, (402 chromosomes and corresponding open reading frames, from 366 genomes in total) were downloaded from NCBI Genbank
For notation we let
Average values are found using the equation:
The statistical information potential of the different oligonucleotide sizes was estimated by comparing frequency functions
GC-content in non-coding regions was calculated by finding GC-content in open reading frames and whole chromosomes with the following formula:
The superscripts
GLM based regression analysis was used to examine the relationship between global GC content (predictor) and genome size (response). Because of the highly non-linear nature of the data, Spearman's rank based correlation test was additionally used between genome size and global GC content.
OUV was calculated for each genome with the following formula:
OUV was calculated for both whole chromosomes and coding regions. Regression analysis was then carried out between OUV values (response), both coding and whole chromosomes, and global GC content (predictor). The resulting regression equations, with corresponding coefficient of determination, here denoted by
The OUD test gives an average estimate of how oligonucleotide frequencies vary within prokaryotic chromosomes. Variance is calculated between the oligonucleotide frequencies calculated from a 40 kbp non-overlapping sliding window and mean genomic oligonucleotide frequencies with the following formula:
Regression analysis was performed for both whole chromosomes and coding regions with OUD as response and global GC content as predictor. The resulting regression equations with corresponding coefficient of determination,
The ratio of observed divided by expected OUD values was additionally used to test whether any fluctuations in tetranucleotide frequencies could be detected in coding and non-coding regions. To calculate this, the OUD values obtained for each chromosome with Formula 6 was divided by the following equation:
Variation of nucleotide frequencies within genomes was calculated for each chromosome similarly to the OUD test, but expected oligonucleotide frequencies based on Formula 5 were used instead of observed. This is the same as calculating the variance between local and global GC content (i.e. GC content) and Formula 7 was used for this as well.
Regression analysis was performed on values obtained with Formula 7 and global GC content for both coding and whole chromosomes. The resulting equations, with the corresponding coefficients of determination,
Microsoft Excel file consisting of the data used to generate the results in the manuscript. Each column is labeled according to the abbreviations used in the text and additionally explained on a separate sheet.
(0.17 MB XLS)
Special thanks to Simon Hardy and the reviewers for constructive comments and for critically reading through the manuscript. Also, thanks to Stig Larsen for help with the statistical analysis and to Stein Marvold for the graphical image.