JSP, GB, and DH conceived and designed the experiments. JSP and GB performed the experiments. JSP analyzed the data. JSP, AS, KR, KLT, ESL, JK, and WM contributed reagents/materials/analysis tools. JSP and DH wrote the paper.
¤ Current address: Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
The authors have declared that no competing interests exist.
The discoveries of microRNAs and riboswitches, among others, have shown functional RNAs to be biologically more important and genomically more prevalent than previously anticipated. We have developed a general comparative genomics method based on phylogenetic stochastic context-free grammars for identifying functional RNAs encoded in the human genome and used it to survey an eight-way genome-wide alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebra-fish, and puffer-fish genomes for deeply conserved functional RNAs. At a loose threshold for acceptance, this search resulted in a set of 48,479 candidate RNA structures. This screen finds a large number of known functional RNAs, including 195 miRNAs, 62 histone 3′UTR stem loops, and various types of known genetic recoding elements. Among the highest-scoring new predictions are 169 new miRNA candidates, as well as new candidate selenocysteine insertion sites, RNA editing hairpins, RNAs involved in transcript auto regulation, and many folds that form singletons or small functional RNA families of completely unknown function. While the rate of false positives in the overall set is difficult to estimate and is likely to be substantial, the results nevertheless provide evidence for many new human functional RNAs and present specific predictions to facilitate their further characterization.
Structurally functional RNA is a versatile component of the cell that comprises both independent molecules and regulatory elements of mRNA transcripts. The many recent discoveries of functional RNAs, most notably miRNAs, suggests that many more are yet to be found. Computational identification of functional RNAs has traditionally been hampered by the lack of strong sequence signals. However, structural conservation over long evolutionary times creates a characteristic substitution pattern, which can be exploited with the advent of comparative genomics. The authors have devised a method for identification of functional RNA structures based on phylogenetic analysis of multiple alignments. This method has been used to screen the regions of the human genome that are under strong selective constraints. The result is a set of 48,479 candidate RNA structures. For some classes of known functional RNAs, such as miRNAs and histone 3′UTR stem loops, this set includes nearly all deeply conserved members. The initial large candidate set has been partitioned by size, shape, and genomic location and ranked by score to produce specific lists of top candidates for miRNAs, selenocysteine insertion sites, RNA editing hairpins, and RNAs involved in transcript auto regulation.
Many new classes of functional RNA structures (fRNAs), such as snoRNAs, miRNAs, splicing factors, and riboswitches [
The development of computational methods that can efficiently identify fRNAs by comparative genomics has been hampered by the fact that fRNAs often exhibit only weakly conserved primary-sequence signals [
The many non-human vertebrate genomes now sequenced can be aligned against the human genome, leading to a multiple alignment with considerable information about the evolutionary process at every position [
We constructed a whole-genome alignment of the human [
(A) Schematic representation of human genome and conserved elements. The conserved elements define the input alignments.
(B) Segment of eight-way genomic alignment.
(C) The SCFG of the fRNA model defines a distribution over all possible secondary-structure annotations. One of the many possible secondary structures is shown in parenthesis format. Substitutions in pairing regions of the alignment are color-coded relative to human: compensatory double substitutions are green, and compatible single substitutions are blue.
(D) Color-coded fold corresponding to the secondary-structure annotation of the alignment.
(E) Two phylogenetic models are used to evaluate the possible secondary-structure annotations: unpaired columns are evaluated using a single-nucleotide phylogenetic model. Paired columns are combined and evaluated using a di-nucleotide phylogenetic model. Horizontal branch lengths reflect the expected number of substitutions.
We classified these candidate folds according to three different criteria: their size, their genomic location, and their overall shape. We distinguished two size ranges: short (between five and 15 pairing bases, 39,075 folds) and long (more than 15 pairing bases, 9,404 folds); five types of genomic location: coding (12,736 folds), 3′UTR (3,331 folds), 5′UTR (334 folds), intronic (11,777 folds), and intergenic (20,301 folds); and four shape-types: hairpins (42,964 folds),Y-shaped (3,479 folds), clover-shaped (250 folds), and more complex shapes (1,786 folds). This scheme results in 40 different RNA fold prediction categories. Candidate folds were also clustered by proximity in the genome or overlap with cDNAs into sets of folds that are likely to be part of a single underlying RNA transcript. This grouped the 48,479 candidate RNA folds into 23,287 candidate structure–containing transcripts. Finally, the folds within each category were ranked by a length-normalized likelihood-ratio score that we call the folding potential score (fps), and a shuffling scheme was used to tentatively estimate the rate of false-positive predictions in each category as a function of score (
We mapped all available human and non-human mRNAs and ESTs to the human genome and determined the enrichment of hits to our set of candidate RNA folds relative to the background hit rate in genomic DNA. These were found to vary from 3.6× (cDNA from humans) to 11.4× (non-human EST). This is significantly higher than the enrichments observed for the full set of conserved elements from which these candidates were chosen (
We also found that predictions at known fRNAs generally score higher on the strand of the fRNA compared to its reverse complement (this is, e.g., the case for 89% of the known miRNAs we predict). The asymmetry is primarily caused by the ability of GU (or UG) to pair, but not its reverse complement AC (CA). Since the most common types of substitutions in RNA stems involve GU (or UG) pairs, this can have a pronounced effect on the EvoFold score, thus allowing the strand association of a fold to be inferred by comparing the score of an alignment with the score of its reverse complement. In cases where the candidate RNA is contained in a known transcript, the EvoFold score for the sense strand (i.e., the strand complementary to the template strand for transcription) is often significantly higher than for the anti-sense strand (
Using a shuffling approach, we estimate that the set of 48,479 candidates contain 18,500 partially correct fRNAs (see
See
Folds are classified according to (A) size (number of pairing bases), (B) location in the genome, and (C) shape. The relative abundance of each class of folds is indicated. For (B), also shown is the genomic span of the conserved segments relative to their genomic location, for comparison.
Three quarters of the predicted folds are short. These are likely to represent a mix of small complete folding units and partial predictions of larger folds, where only a small core element had sufficient evolutionary covariation to be detected by our method. Among the long folds, about 82% are intergenic or intronic, 5.5% are in 3′UTRs, 0.5% in 5′UTRs, and a surprising 12% (550 folds) overlap known coding regions. These are discussed further below. As expected, the small folds are predominantly single hairpins; there are usually not enough paired bases in these to support more complex stable structures. The long folds show a more varied shape distribution, but are also dominated by simple hairpins. Again, since these are often partial structural predictions, this breakdown is likely to be somewhat biased toward the simpler fold types.
Because EvoFold is designed to look for RNAs that are conserved in structure and remain in the same genomic context in all vertebrates, there are likely to be additional fRNAs not detected in this survey. There are some classes of known functional RNAs that are too mobile or rapidly evolving for EvoFold to detect, such as tRNAs and snoRNAs. The vertebrate tRNAs spawn many lineage-specific copies that land in different places in the genome, most of which are pseudogenes, so that the remaining functional copies often end up in a different genomic context in different vertebrate lineages [
For other known classes of RNAs, such as miRNAs, EvoFold achieves a high rate of sensitivity, finding nearly all known members. To evaluate EvoFold's sensitivity, we performed a 5-fold cross-validation test using various curated sets of known RNAs. These tests showed that EvoFold is quite good at detecting some known classes of RNAs, such as miRNAs and Histone 3′UTR stem loops (
EvoFold Sensitivity
Since the fps used by EvoFold ranks deeply conserved compact folds highly, we also defined an alternative score directly based on the substitution evidence and used it to define a ranked set of 517 ncRNA candidates (see
We evaluated the relative benefit of using an eight-way alignment instead of a pair-wise alignment by redoing the sensitivity experiments and part of the shuffling experiments using only the mouse–human subalignment. The sensitivity on the mixed set of Rfam Seed decreased by 59% and the false-positive rate increased slightly (
The higher-ranked candidate RNAs in several of the fold classifications are greatly enriched for certain classes of known RNAs. In particular, we see a strong enrichment for known miRNAs among the higher-ranked candidates in the class of long intronic and intergenic hairpins (
Top-Scoring Long-Intergenic Hairpins
Top-Scoring Long-Intronic Hairpins
The known miRNAs tend to reside in short conserved segments (70% in segments of at most 200 bp), and their stems have relatively few bulges (86% have at most 20% of their bases in bulges). Using these additional criteria we defined a more specific set of 277 miRNA candidates from among the 3,500 predicted long intergenic and intronic hairpins. This set contained 90 known miRNAs and 187 novel candidates, with an estimated false-positive rate of 15% (see
While miRNAs probably comprise a significant fraction of the high-scoring intergenic and intronic hairpins, it is quite possible that the majority of the folds in these categories have other functions. In particular, the three highest-scoring long intronic hairpins all are found in introns of ion channel genes, which are frequently targets of RNA editing by A-to-I conversion involving hairpins such as these [
The candidate RNAs contain a surprising number of long folds that overlap coding regions. Coding folds are fascinating for at least two reasons. First, they often function in genetic recoding, which, as in the RNA editing in
The 15 top-ranking long-coding hairpins contain eight well-studied RNAs, five of which are involved in genetic recoding in the form of RNA editing (R-G site of
Top-Scoring Long-Coding Hairpins
Among the seven novel candidate RNAs in the top 15, we predict at least three to be involved in genetic recoding. Two of them are associated with the known selenoproteins
(A) Gene structure, EvoFold predictions, and conservation around the selenocysteine insertion site of selenoprotein T (SELT). The pairing regions of the hairpin are shown in dark green and can be seen to start only eight bases downstream of the UGA insertion site (indicated by *). Arrows indicate direction of transcription.
(B) Annotated segment of eight-way alignment spanning the predicted hairpin.
(C) Depiction of hairpin, which is shown with T instead of U to facilitate comparison with the genomic sequences. Pairs are color-coded by presence of substitutions in the eight-way alignment (see b).
The third is the highest-ranking long-coding hairpin, found in the
(A) Gene structure, EvoFold predictions, cDNAs, conservation, and eight-way alignment are shown at the start of the second exon of the
(B) Depiction of hairpin (see
(C) Which would lead to a lysine to arginine amino acid change.
Of the four remaining candidate long-coding hairpins, two are in genes of unknown function
(A) Gene structure and EvoFold predictions are shown around the first exon of
(B) Annotated segment of the eight-way alignment spanning the long, miRNA-like 5′UTR-hairpin (see
(C) Depiction of folds.
In addition to new examples of previously known RNA families, the high-ranking candidate RNAs also include several completely novel families. One of these is represented by the highest and fourth-highest ranking candidates in the category of long clover-shaped folds. These are located less than 3,500 bases apart, and both are overlapped by transcripts of the little-characterized gene
(A) Gene structure, EvoFold predictions, and cDNAs around the end of the gene
(B) Annotated segment of eight-way alignment spanning the 3′UTR fold (see
(C) Depictions of 3′UTR fold (left) and intronic fold (right).
(D) Annotated alignment of human primary sequences of 3′UTR and intronic folds. The alignment is annotated with the secondary structures of the folds and substitution differences in corresponding pairs are color-coded (see
In the spirit of the last example above, we grouped the RNA-fold predictions into paralogous families based on their primary-sequence homology. We disregarded sequences that could cause homology to be inferred for trivial reasons, i.e., repeats, pseudogenes, coding regions, etc. (see
Known families of fRNAs were recovered, such as the histone 3′UTR stem loops (46 known folds, one family), families of known miRNAs (72 known folds, 29 families), and families of RNA editing hairpins in GRIA genes (three known folds, one family). But most of the families were completely new. Some contain long intergenic and intronic hairpins and are likely to be new families of miRNAs (e.g., 17 of our miRNA candidates are found in 11 families). Others contain hairpins in ion-channel genes not previously characterized as undergoing RNA editing (e.g., a cluster of three coding hairpins overlapping sodium channel exons in
We have conducted a survey of the human genome to identify functional RNA structures through comparative genomics using an eight-way whole-genome sequence alignment. While this alignment contains considerably more evolutionary information than has been previously available, these currently available genomes are still quite limited in terms of their statistical power to detect negative selection [
This initial survey suggests that there are many more functional RNAs in the human genome than are represented in the current RNA sequence databases. We estimate that these databases annotate 1,207 RNA genes in the human genome (see
The RNA folds we predict with the highest confidence include many known fRNAs, such as miRNAs and genetic recoding signals, as well as thousands of new fRNA candidates, a large fraction of which are supported by the presence of compensatory substitutions. Some of these new fRNAs enlarge existing families while others group into small new families. Detailed analysis of individual candidates has revealed additional supporting evidence and has allowed specific functional hypotheses to be formulated in some cases, including the new SECIS elements, RNA editing hairpins, regulatory hairpins, and miRNA candidates discussed above. We estimate that about 500 coding regions contain overlapping functional RNA structures, and that a non-negligible fraction of these may contain undocumented examples of genetic recoding.
The EvoFold method we have developed was trained to only predict RNA stems that are well-supported by a consistent evolutionary signal in clearly orthologous copies from many species. To guarantee orthology, the alignments used require that aligned sequences from different species appear in the same genomic context, i.e., have orthologous flanking DNA, in each species. This greatly reduces the number of false-positive predictions due to mobile elements such as transposons and retroposed pseudogenes. However, it causes us to miss some highly mobile known fRNAs, such as tRNAs and snoRNAs, even with a relatively liberal threshold that allows an estimated 62% false positives in our overall set of predictions. Identifying mobile fRNAs with a general model of molecular evolution will require logic for lineage-specific duplication and loss of function in addition to the simple evolution of orthologous copies that the EvoFold model embodies.
Alignment errors can also disrupt the evolutionary signal of true fRNAs, and thus improvements to the current sequence-alignment scores might improve the results. Local alignment errors involving only a few bases are unlikely to affect the entire structure and thus should normally allow at least a partial structure with a reduced signal to be identified. However, more extensive errors, where non-orthologous regions are aligned, will most likely cause the fRNA to be missed completely as discussed above.
EvoFold's rate of false positives is much lower among the highest-scoring predictions, but it never goes completely to zero, even for the largest predicted structures. One problem is that the elements where negative selection is strongest, the ultraconserved regions [
Sequence comparisons between novel predicted fRNAs verify that some of these can be grouped into small paralogous families, but most appear as singletons. Since many fRNAs undergo lineage-specific expansions [
The EvoFold scoring scheme very highly ranks compact folds with a high ratio of paired to unpaired bases, such as miRNAs and histone 3′UTR stem loops. Indeed, these two families stand out prominently in this survey, and their existence would have been a clear-cut new outcome of this study had it not already been known. One of the reasons they rank so highly is because the fps is a length-normalized likelihood ratio, which tends to emphasize the ratio of paired to unpaired bases rather than the total number of paired bases. Other normalization schemes may emphasize other families of fRNAs as shown by the substitution-ranked ncRNA candidates (see
This set of fold predictions represents what we believe is the first general survey of evolutionarily conserved human fRNAs. (Another survey, based on our multiple alignments and PhastCons detection of conserved segments as well, has come to our attention during the final stages of preparing this paper [
The EvoFold program takes a multiple alignment and a phylogenetic tree as input, and outputs a specific RNA secondary-structure prediction and an fps (
Phylo-SCFGs were developed by Knudsen and Hein in 1999 and can be seen as an extension of phylo-HMMs [
Two types of phylogenetic models are used by the phylo-SCFGs: a single-nucleotide model and a di-nucleotide model (
The phylo-SCFGs are composed of two components: a structural and a nonstructural one (
The fRNA model contains both the structural and the nonstructural component. In contrast, the background model contains only the nonstructural component. See
EvoFold uses the fRNA model to assign a specific RNA secondary-structure prediction to an input alignment (
The fps measures the overall tendency for the alignment to contain any fRNA. It is calculated as a log-odds score between the likelihood of observing the alignment (
The false-positive rate of EvoFold was estimated by applying it to a set of alignments that have been randomized to remove the signal of any true fRNAs, but which retain the same base composition, substitution pattern, and conservation pattern as the original alignments. The false-positive rate can be seen to depend on the size of the predicted folds (
The alignments used to train EvoFold were prepared from a conserved subset of the Rfam Full database (version 6.0) [
EvoFold was applied to the conserved elements of an eight-way multiz [
A single phylogenetic tree, including branch lengths, was estimated from the genomic alignment using the PhastCons program [
The fold predictions were compared against different classes of fRNAs: the 207 human micro RNAs found in the miRNA Registry version 5.1 [
The known gene annotation from the UCSC Human Genome Browser (May 2004 assembly) [
Folds likely to be nonfunctional based on other annotations, alignments, or genomic location were discarded from the initial set. The filtering comprised certain types of repeats (many trivial folds), regions with synteny breaks (many pseudogenes), and regions homologous to the mitochondrial genome (many pseudogenes). The filters were based on the following UCSC Human Genome Browser data: simple and low-complexity repeats from the RepeatMasker track, synteny information from the mouse net track [
5′UTR, coding, and 3′UTR folds were considered part of the same transcript if overlapped by a known gene annotation (see above). Intronic and intergenic folds were considered part of the same transcript if separated by fewer than 250 bases. The false-positive rate was estimated from the folds of the relevant genomic types using the randomization procedure described below (see also Validation).
All input alignments shorter than 450 bases (98% of total) were randomized by first permuting columns with no substitutions and then permuting columns with some substitutions. The resulting alignments thus maintain the conservation pattern, the substitution pattern, and the nucleotide bias of the original alignments, but have lost the signal of any true fRNA stems.
The folds were clustered according to primary-sequence homology, as given by the human Blastz self track of the UCSC browser, thereby defining a set of paralogous families [
Top, length of folds; bottom, length of conserved segments.
There are 252 folds longer than 250 nucleotides and 1727 conserved segments longer than 1000 nucleotides, which are not included in the above plots.
(18 KB PDF)
(A) Count of false positives for different size-ranges of folds. Black bars indicate number of predictions made in randomized alignments (false positives), gray bars indicate the additional number of predictions made in original alignments (true positives). The estimated fraction of false positives is indicated above each column.
(B and C) Fraction of false positives in different top-score–ranked subsets of short folds (B) and long folds (C). Same color coding as for (A).
(30 KB PDF)
Left column, short folds; right column, long folds.
For all parts the
Definition of properties:
(A) The sequence conservation of scores are measured at the input element level and the percentiles are relative to their distribution among all the folds.
(B) The bulge fraction is the percentage of bases in stems found in bulges.
(C and D) The genic location and the fold shape are taken from the fold classification scheme (see
(34 KB PDF)
The
(28 KB PDF)
(A) Nonstructural component, (B) structural component.
Nomenclature: | denotes a choice between different productions; x, single column emissions; xl and xr, left and right part of pair emissions, respectively.
A corresponding graphical overview of these grammar components are given in
(45 KB PDF)
(A) Nonstructural component, (B) structural component.
The state types are given in parentheses. Arrows indicate possible state transitions. The transition from the bifurcation state leads to two states, a left (l) and a right (r), as indicated on the graph. The unpaired and the loop & bulge states have associated single-column emission distributions (specified by a single-nucleotide phylogenetic model). The stem pair state has an associated di-column emission distribution (specified by a di-nucleotide phylogenetic model).
(31 KB PDF)
(176 KB PDF)
The fold counts, estimated true positive rate (in parentheses), and estimated true positive counts are given for each location/shape class of short folds. The “any shape” row and the “any location” column give the marginalized counts for each set of fold classes. The entry at the lower right corner thus holds the overall counts for the set of long folds.
(26 KB PDF)
See legend for
(29 KB PDF)
(33 KB PDF)
The sensitivity column gives the number of known fRNAs recognized by EvoFold using the human–mouse subalignments divided by the total number of fRNAs in the input segments. The relative sensitivity column gives the ratio between the sensitivity using only the human and mouse subalignment and the complete eight-way alignment.
(147 KB PDF)
Accession numbers from Swiss-Prot (
The GenBank (
We thank Todd Lowe, Terry Furey, and Charles Sugnet for rewarding discussions; Katherine Pollard for statistical advice; the UCSC Genome Browser staff for the UCSC browser and their help with alignments and data management; and Jane Rogers for providing the zebra-fish genome.
(adenosine deaminase acting on RNA
base pair
DiGeorge syndrome critical region
folding potential score
functional RNA structures
phylogenetic stochastic context-free grammar
selenocysteine insertion sequence
University of California Santa Cruz