The authors have declared that no competing interests exist.
The assumption that RNA can be readily classified into either protein-coding or non-protein–coding categories has pervaded biology for close to 50 years. Until recently, discrimination between these two categories was relatively straightforward: most transcripts were clearly identifiable as protein-coding messenger RNAs (mRNAs), and readily distinguished from the small number of well-characterized non-protein–coding RNAs (ncRNAs), such as transfer, ribosomal, and spliceosomal RNAs. Recent genome-wide studies have revealed the existence of thousands of noncoding transcripts, whose function and significance are unclear. The discovery of this hidden transcriptome and the implicit challenge it presents to our understanding of the expression and regulation of genetic information has made the need to distinguish between mRNAs and ncRNAs both more pressing and more complicated. In this Review, we consider the diverse strategies employed to discriminate between protein-coding and noncoding transcripts and the fundamental difficulties that are inherent in what may superficially appear to be a simple problem. Misannotations can also run in both directions: some ncRNAs may actually encode peptides, and some of those currently thought to do so may not. Moreover, recent studies have shown that some RNAs can function both as mRNAs and intrinsically as functional ncRNAs, which may be a relatively widespread phenomenon. We conclude that it is difficult to annotate an RNA unequivocally as protein-coding or noncoding, with overlapping protein-coding and noncoding transcripts further confounding this distinction. In addition, the finding that some transcripts can function both intrinsically at the RNA level and to encode proteins suggests a false dichotomy between mRNAs and ncRNAs. Therefore, the functionality of any transcript at the RNA level should not be discounted.
Numerous studies have demonstrated that the true catalog of RNAs encoded within the
genome (the “transcriptome”) is more extensive and complex than
previously thought (reviewed in
Unsurprisingly, a great deal of attention is now focused on the noncoding
transcriptome. Dominating this field of inquiry has been the discovery of thousands
of small RNAs (<200 nt in length). Many of these have since been classified
into novel categories (e.g., microRNAs, PIWI-associated RNAs, and endogenous small
interfering RNAs) on the basis of function, length, biogenesis, structural/sequence
features, and protein-binding partners (reviewed in
The biological significance of these long ncRNAs is controversial. Despite an
increasing number of long ncRNAs having been shown to fulfill a diverse range of
regulatory roles (reviewed in
One of the most fundamental criteria used to distinguish long ncRNAs from
mRNAs is ORF length. Since short putative ORFs can be expected to occur by
chance within long noncoding sequences, minimum ORF cutoffs are usually
applied to reduce the likelihood of falsely categorizing ncRNAs as mRNAs.
For instance, the FANTOM consortium originally used a cutoff of 300 nt (100
codons) to help identify putative mRNAs
Twenty thousand transcripts of varying length and random nucleotide composition were computationally generated and scanned for ORFs. The maximum ORF and transcript lengths were plotted and fitted to a logarithmic curve. The shaded regions represent incidences of randomly occurring ORFs at 1, 2, or 3 standard deviations from the mean. The red line indicates the 300 nt ORF threshold that is commonly used to distinguish protein-coding genes in transcript classification pipelines. Therefore, this plot illustrates that for transcripts longer than ∼1000 bp, such a threshold may define transcripts as protein-coding that would be expected to occur by chance. The function y = 91.Ln(x)−330, which approximates random ORF incidence according to transcript length at two standard deviations above the mean (i.e., 95% confidence interval, indicated in green), could be used to discriminate noncoding from protein-coding transcripts in a transcript-length–dependent manner.
Using putative ORF length alone, although straightforward to apply across
large datasets, is problematic for various reasons. First, bona fide long
ncRNAs will by chance contain putative ORFs that are quite long. For
instance,
Given the problems of relying solely upon ORF size, an alternative approach
to discriminating long ncRNAs from mRNAs is to assess putative ORFs for
similarity to known proteins or protein domains, since such homology
provides indirect evidence of function as an mRNA. Indeed, the vast majority
of putative human ORFs without cross-species counterparts is likely to be
random occurrences
A few methods designed to detect ORF conservation can be used to distinguish
ncRNAs from mRNAs on a transcriptome-wide scale. These comparative
approaches include the programs CSTminer
There are also other problems in using ORF conservation to identify
protein-coding RNAs. First, these approaches are limited by the
comprehensiveness and accuracy of current protein annotations. For instance,
The approaches described above are primarily designed to identify mRNAs.
Consequently, long ncRNAs are typically defined indirectly through an
absence of mRNA-like characteristics. In contrast, a number of studies have
used the presence of conserved predicted RNA secondary structure to identify
ncRNAs imputed to have functional properties. These include the programs
QRNA
As well as computational methods, several experimental strategies have also
been used to try to distinguish mRNAs and ncRNAs. For instance, in vitro
translation assays have been performed in individual cases to test whether a
putative ORF is translated into protein
Reliable classification of novel transcripts into mRNAs or ncRNAs is
predicated on the assumption that they represent genuine, full-length
transcripts. However, incomplete reverse transcription, internal priming of
pre-mRNAs, and genomic contamination can all result in the generation of
spurious or truncated transcripts, many of which are likely to masquerade as
ncRNAs
Many of the strategies described above are complementary, and can be combined
to good effect. For instance, CRITICA uses statistical techniques in
addition to its comparative approach
Two recently described tools, CPC and CONC, use supervised learning
algorithms known as support vector machines to distinguish mRNAs from ncRNAs
Despite recent advances such as support vector machines to distinguish ncRNAs from
mRNAs, large numbers of novel transcripts remain ambiguous and difficult to
definitively categorize. CONC, for instance, estimated that ∼28,000 FANTOM
cDNAs were ncRNAs, but >50% of these predictions fell outside the
reliable range
The first report of a bifunctional RNA was the human
Additional examples of bifunctional RNAs have also recently emerged. The
The number of documented cases of bifunctional RNAs is limited. However, as mentioned
earlier, conserved secondary structures are commonly found in mRNAs, which suggests
that bifunctional RNAs might be widespread. Indeed, it was recently predicted that
in yeast as much as 5% of mRNAs function independently as RNA, and it was
estimated that this proportion is likely to be significantly greater in higher
eukaryotes
As the number of protein-coding genes continues to be revised downward