ÖS, LA, and JL conceived and designed the experiments. ÖS performed the experiments. ÖS, LA, and JL analyzed the data. ÖS contributed reagents/materials/analysis tools. ÖS, LA, and JL wrote the paper.
The authors have declared that no competing interests exist.
According to current estimates there exist about 20,000 pseudogenes in a mammalian genome. The vast majority of these are disabled and nonfunctional copies of protein-coding genes which, therefore, evolve neutrally. Recent findings that a Makorin1 pseudogene, residing on mouse Chromosome 5, is, indeed, in vivo vital and also evolutionarily preserved, encouraged us to conduct a genome-wide survey for other functional pseudogenes in human, mouse, and chimpanzee. We identify to our knowledge the first examples of conserved pseudogenes common to human and mouse, originating from one duplication predating the human–mouse species split and having evolved as pseudogenes since the species split. Functionality is one possible way to explain the apparently contradictory properties of such pseudogene pairs, i.e., high conservation and ancient origin. The hypothesis of functionality is tested by comparing expression evidence and synteny of the candidates with proper test sets. The tests suggest potential biological function. Our candidate set includes a small set of long-lived pseudogenes whose unknown potential function is retained since before the human–mouse species split, and also a larger group of primate-specific ones found from human–chimpanzee searches. Two processed sequences are notable, their conservation since the human–mouse split being as high as most protein-coding genes; one is derived from the protein Ataxin 7-like 3 (ATX7NL3), and one from the Spinocerebellar ataxia type 1 protein (ATX1). Our approach is comparative and can be applied to any pair of species. It is implemented by a semi-automated pipeline based on cross-species BLAST comparisons and maximum-likelihood phylogeny estimations. To separate pseudogenes from protein-coding genes, we use standard methods, utilizing in-frame disablements, as well as a probabilistic filter based on Ka/Ks ratios.
Svensson, Arvestad, and Lagergren conducted a genome-wide survey for and analysis of human pseudogenes, i.e., gene copies with lost protein-coding ability, with the aim of discovering biologically functional ones. Their main motivation was a 2002
Pseudogenes are sequences of genomic DNA lacking the protein-coding capability of their paralogous counterpart [
Studies of pseudogene populations are often motivated by the dilemma that their similarity to ordinary genes constitutes for gene finders and hybridization experiments. Pseudogene sequences can, given their nonfunctionality, be viewed as a molecular fossil and have been used to measure background genomic substitution rates [
However, evidence has occasionally been found, in
Several surveys [
In a recent article [
A human–mouse comparative study [
That a pseudogene is transcribed is not sufficient evidence of biological function. To obtain functional candidates, we decided to look for conserved pseudogenes common to human and mouse, originating from one duplication predating the human–mouse species split and having evolved as pseudogenes since the species split. In cases where the species split occurred sufficiently early, strong conservation and ancient origin gives evidence of the potential functionality of the pseudogenes. We have developed a pairwise comparative genomics methodology based on an explicit evolutionary model, which focuses on pseudogenes common to the two lineages. We also test the potential functionality of the found pseudogenes using enrichment of transcription and synteny.
We describe our methodology using the example of a human–mouse comparison. Our procedure takes as input a quartet of sequences representing, respectively, a human gene, a corresponding human pseudogene, the orthologous mouse gene, and a corresponding mouse pseudogene, and analyzes how they have evolved. All four basic evolutionary scenarios that can occur with respect to duplication and gene-to-pseudogene transitions are described below. When analyzing how well a scenario describes the evolution of a quartet, different models of sequence evolution are used for gene and pseudogene lineages.
The first scenario,
G and
An alternative scenario,
The third scenario,
A fourth scenario where the human gene has the mouse pseudogene as a sibling in the gene tree is conceivable. We have never observed this scenario.
We have applied our comparative methodology to human–mouse as well as to human–chimpanzee and found the first examples of human pseudogenes showing signs of functionality.
We started with the 12,687 presumably orthologous protein pairs retrieved (see
This initial search with subsequent refinement resulted in 168,855 such quartets. For the vast majority of these quartets, one or both pseudogene sequences overlap regions of
The set that remains after filtering constitutes 11,146 sequence quartets originating from 1,349 protein pairs. The distribution of quartets per protein pair is highly nonuniform. While many gene pairs lack corresponding pseudogene pairs, a handful (EF12, G3PT, LDHB, TSY1, UB46, and several ribosomal genes) are origins of large pseudogene families in both species. Using the mutual-best-hit filtering outlined in
The tree shows the evolutionary history for a sequence set associated with the ATXN7L3 orthologous proteins. We have here found two potentially pseudogenic sequences in each species and this gives us a total of four quartets to investigate; the gene-sequence pair together with any human–mouse combination of the pseudogenes. It is unlikely that the human chrX pseudogene is closely related to any of the mouse ones and therefore any quartet including the sequence from the X chromosome should be of limited interest. If we pair a particular human pseudogene only with the most similar mouse pseudogene (and vice versa), the sole remaining example is the human chr12–mouse chr10 pair.
We used the partition of our data induced by this classification in combination with mutual-best-hit filtering. The number of sequence quartets belonging to the classes are: class 1—247 quartets; class 2—299; class 3—146; and class 4—761 (see Table 1).
Our aim is to find those quartets for which
For the majority of our 1,453 quartets, data support
Interestingly, we note a bimodal pattern with one large hump distributed around 1.1 and another one distributed around 0.9. That is, in a large majority of cases, data show clear preference for
We now use the same technique to compare
For 73 of the 425 quartets,
To summarize, we have 30 quartets for which the sequences suggest that: 1) the pseudogenes are evolutionarily conserved since before the human and mouse speciation; 2) they have been pseudogenes since prior to the speciation.
Because we find 30 such quartets, and the number of quartets expected to pass our scenario test is 1453 * 0.001 * 0.1 ≈ 0.15, it is reasonable to conclude that a significant number of these 30 quartets are ancient pseudogenes, i.e., satisfying 1) and 2).
We are now going to investigate these 30 sequence quartets further, with the aim of testing their potential biological function. The criteria that will be our focus are synteny, expression evidence, and conservation.
Synteny can be used as a means to evaluate our methodology's capacity to separate S1 and S3 from S2 quartets. It is also interesting to compare the fraction of syntenic quartets among S1, S3, and genes. The latter can be seen as a test of functionality.
It has long been known that eukaryotic genomes undergo rearrangements on both microscopic (intrachromosomal with a span < 1 Mb) and macroscopic (intrachromosomal with larger span, as well as interchromosomal) level during evolution [
The orthologous pairs of protein-coding genes in our data set have the following synteny relations: 69% syntenic, 2% reversed syntenic, 11% corresponding chromosomes, 4% nonsyntenic, and 13% unknown synteny (see
It is reasonable that sequences that have originated from duplication events prior to the species split (sequences belonging to
Number of Human–Mouse Sequence Pairs prior to and following Mutual-Best-Hit Filtering
Number of Sequence Pairs in Each Class Favoring a Particular Scenario
It is notable that within classes 1, 2, and 3, ten of the 13
If we again consider the ATXN7L3 tree (
We investigated whether our candidates for potential function are enriched for transcription or not, by searching publicly available databases for transcript sequences, expressed sequence tags (ESTs) and mRNAs. An EST or mRNA sequence is postulated to come from a specific pseudogene if its sequence is more similar to the pseudogene than it is to any other known gene or pseudogene (see
Among the 20 syntenic S1 sequence pairs, the only completely unexpressed example is the IMB1 copy found on the X chromosome. This pair shows clear preference for
The majority of these pseudogenes are much less expressed than their respective genes (to the latter, one can generally map large numbers of ESTs originating from tissues throughout the body).
To perform an enrichment test we need a good comparison set. We believe that
Instead, we focus on the correlation between the
If we assume that the rate of pseudogene creation does not vary over time, then the low number of detected early pairs—only nine out of 198 non-unclear
Human and Mouse Expression for 262 S3 Quartets Selected as Described in
According to estimates in [
We will now address where our putatively functional pseudogenes are placed along that scale. We note (
Blue stars indicate genes. Red circles indicate pseudogenes. The histogram shows, for reference, the conservation of all genes giving rise to pseudogenes. Compare with
Conservation Percentage in and around the Pseudogene
(A) An alignment, visualized with TeXshade [
The most interesting part consists of columns 1–468 (boxed green), which according to several EST and mRNA sequences is the only segment expressed. It consists of a highly conserved part, 1–288 (red), which is a potential open reading frame, followed by part 289–468 with pseudogenic disablements.
(B) Selected parts of the alignment of the ATX1 copies which are also processed. The protein-coding genes contain eight exons of which only parts of the last two code for protein. The entire segment of the pseudogenes corresponding to the protein-coding parts of the genes is expressed. The possibility that the processed copies are protein-coding cannot not be completely ruled out, however. Indeed, each pseudogene consists of one single 2,068-bp-long open reading frame. However, the frame induced by the alignments to the protein-coding genes contains several pseudogenic disablements.
Substitution rates vary along and between chromosomes. To make sure that it is the pseudogenes only, and not their genomic vicinities in general, that are conserved, we also aligned a 1,000-bp section upstream and downstream of each pseudogene. We observe in most cases (
Conceivably, a potential pseudogene could in its close vicinity have protein-coding exons originating from the same gene. To exclude this possibility, we also checked the proximity for signs of exons originating from the same gene, with potentially intact protein-coding ability. No additional such protein-coding exons were found. For the absolute majority of our pseudogenes, no hit could be found on the same chromosome, and in no case was any hit found closer than 10,000 bp.
We also applied our methodology to the human–chimpanzee pair of genomes. This choice was motivated by our desire to discover young pseudogenes. Remember that the mouse Makorin pseudogene, although vital, has only been functional over a relatively short evolutionary period [
The procedure was the same as for human–mouse. The chimpanzee data was downloaded from Ensembl, including assembly 1 as of April 2005 together with protein sequences and gene-sequence data. For human–chimpanzee, sequence conservation is less effective as a means to separate functional from nonfunctional pseudogenes. The reason is of course that many pseudogenes originating before the comparatively recent primate species split can be expected to be nonfunctional, although they have not diverged sufficiently to be easily recognized as such. So, while in the human–mouse case we can be relatively confident that syntenic pseudogenes that prefer
As expected, the human–chimpanzee comparison resulted in a large set of pseudogene pairs. We therefore restricted our analysis to the most interesting class, i.e., class 1. We found 742 class 1 pseudogenes belonging to quartets favoring
Percentage of Expressed Pseudogenes in Relation to Their Conservation
Many pseudogenes have regulatory regions showing high similarity to those of the corresponding protein-coding genes. This is either because few mutations have occurred in these regions, or alternatively because many of the mutations that have occurred have been selected against, due to functionality of the pseudogenes.
To further purify our result set, i.e., the 742 pseudogenes favoring
Typical values for the background mismatch percentage range from 1.2% to 3% (counting the first but not subsequent indels in a gap), which conforms well with previous estimates of 1.4% [
Human–Chimpanzee Conserved and Expressed Pseudogene Pairs
We have presented and applied a semi-automated methodology to identify pseudogenes of potential biological function. To the best of our knowledge, functional pseudogenes have never been observed in human. Our method uses no prior knowledge other than publicly available data on orthologous relationships for proteins, gene sequences, gene positions, and synteny maps.
The term
We use conserved ancient pseudogenes as candidates of potential function. A computational approach based on support for four different evolutionary scenarios is used to obtain putative ancient pseudogenes. The
We test functionality of our candidates by means of enrichment of synteny as well as of transcriptional activity and degree of conservation. We see, as expected, a clear overrepresentation of synteny for human–mouse pseudogene pairs originating before the species split. Interestingly, we also see tendencies for those examples that have evolved as pseudogenes since the species split to be both more abundantly expressed and more often syntenic than those that have not evolved as pseudogenes. For the latter finding, we believe that enrichment of functionality among our pseudogenes is the most likely explanation.
Judging from what is known from earlier work, the number of detectable pseudogenes originating from before the human–mouse speciation is limited. In [
Three out of four (PDZRN3 being the exception) of our
To determine functionality of a human pseudogene, it is probably not sufficient to use information about whether it has a mouse ortholog or not, because many young pseudogenes can be found among orthologous pairs. Instead, we select only those human pseudogenes with orthologous mouse pseudogenes that satisfy the additional constraint that the least common ancestor was a pseudogene.
The results we present suggest that while functional pseudogenes are relatively rare on a long evolutionary timescale, they nevertheless exist. Our findings include a handful of sequences that are conserved since before the split of primates and rodents. Some of these are sequences predicted by gene finders to be protein-coding. We have found examples with, as well as without, detectable in-frame disablements. Apart from their apparent functional conservation and sometimes extensive expression activity, all these are poorly characterized. This can be due to the fact that some of the originating proteins are themselves not very well known or to the common assumption that pseudogenes are nonfunctional. Further characterization of these genes, their respective pseudogenes, and the interactions between them are areas for further studies.
We have noted with interest recent research activity concerning two of our top candidates, ATX1 and ATXN7L3. As is the case for the Ataxin gene family in general, these are associated with a number of neurodegenerative disorders primarily caused by expanded polyglutamine [
When extending the search to younger pseudogenes, i.e., applying our methodology to human–chimpanzee, the number of obtained pseudogenes is substantially larger than what was obtained in the human–mouse comparison. In this case, however, the assumption that nonfunctional pseudogenes originating before the speciation have diverged beyond recognition is not true. Consequently, filtering out nonfunctional pseudogenes is much harder than for the human–mouse case. Encouragingly, we found that the conservation of many pseudogenes is similar to that of nonsynonymous nucleotides in protein-coding genes (estimated to be 99.4% [
There is an apparent tradeoff between the number of pseudogenes in the result set and the certainty with which we can state that they are functional. It is quite possible that both our choices of species pairs are in fact suboptimal, human–mouse being too evolutionarily distant and human–chimpanzee not distant enough. It will be interesting to apply our methodology on an intermediate timescale, and we plan to conduct a comparison between human and rhesus macaque.
Our methodology includes three main parts (see
To locate pseudogenes, we adopted a large part of the methodology presented in [
We used repeat-masked genomic sequence data NCBI35 (human) and NCBIm33 (mouse) downloaded in January 2005 from the Ensembl database version 27. The Expasy protein database (sprot44, human_trembl and rodent_trembl) was used to assemble protein sequence sets for the two species. We used the Inparanoid [
The mutual-best-hit filtering was performed for each pair of orthologous proteins, by aligning each pair of pseudogenes from the respective species, using bl2seq (from the BLAST package) and then selecting the pair with the best score. We aligned the thereby obtained quartets using the Dialign package [
To select the scenario (
We used the nonparametric version of the Kishino–Hasegawa bootstrap test with 1,000 bootstraps to obtain
To infer whether two pseudogenes are in synteny, we used synteny maps from [
Synteny relations were established for the 7,244 out of 12,678 genes for which gene position data was available.
To find transcription evidence we applied a reciprocal BLAST-based methodology to databases of ESTs and mRNAs. The EST-human, EST-mouse, and Unigene mRNA databases were downloaded from NCBI. Any reciprocal best hit longer than 100 bp and with more than 99% sequence identity to the query sequence was retrieved.
According to Hoeffding's theorem [
Given an alignment of length
We thank Henrik Kaessman, Per Svensson, and three anonymous reviewers for their valuable comments on the manuscript; Ali Tofigh, Johannes Frey-Skött, and Samuel Andersson for constructive discussions; and the Center for Parallel Computers for computational support.
Spinocerebellar ataxia type 1 protein
Ataxin 7-like 3
expressed-sequence tags