Conceived and designed the experiments: WC SP PK. Performed the experiments: ZY CM NL HH. Analyzed the data: AGX LH ZL YX ML XF HH. Contributed reagents/materials/analysis tools: YY MS WC SP PK. Wrote the paper: LH WC SP PK.
The authors have declared that no competing interests exist.
Transcription is the first step connecting genetic information with an organism's phenotype. While expression of annotated genes in the human brain has been characterized extensively, our knowledge about the scope and the conservation of transcripts located outside of the known genes' boundaries is limited. Here, we use high-throughput transcriptome sequencing (RNA-Seq) to characterize the total non-ribosomal transcriptome of human, chimpanzee, and rhesus macaque brain. In all species, only 20–28% of non-ribosomal transcripts correspond to annotated exons and 20–23% to introns. By contrast, transcripts originating within intronic and intergenic repetitive sequences constitute 40–48% of the total brain transcriptome. Notably, some repeat families show elevated transcription. In non-repetitive intergenic regions, we identify and characterize 1,093 distinct regions highly expressed in the human brain. These regions are conserved at the RNA expression level across primates studied and at the DNA sequence level across mammals. A large proportion of these transcripts (20%) represents 3′UTR extensions of known genes and may play roles in alternative microRNA-directed regulation. Finally, we show that while transcriptome divergence between species increases with evolutionary time, intergenic transcripts show more expression differences among species and exons show less. Our results show that many yet uncharacterized evolutionary conserved transcripts exist in the human brain. Some of these transcripts may play roles in transcriptional regulation and contribute to evolution of human-specific phenotypic traits.
Phenotypic differences between closely related species, such as humans and chimpanzees, might be determined to a large extent by differences between their transcriptomes. Recent studies using microarray and high-throughput sequencing technologies have demonstrated that beside annotated genes, a large proportion of the human genome can be transcriptionally active. Little is known, however, about the extent and the conservation of human brain transcripts located outside of the known genes' boundaries. Here, we use high-throughput transcriptome sequencing to characterize the non-ribosomal transcriptome of the human cerebellum and compare it to the transcriptomes of chimpanzee and rhesus macaque. Our results show that close to 40% of all transcripts expressed in the human brain map within repetitive elements. By contrast, less then 10% of the human brain transcriptome corresponds to non-repetitive intergenic regions. Nonetheless, within these regions we identify more than a thousand novel highly transcribed evolutionary conserved locations. Some of the intergenic transcripts show distinct human-specific expression and may have contributed to evolution of human-specific phenotypic traits.
Transcriptome studies conducted by various methodologies, such as conventional sequencing, tiling arrays, and, most recently, high-throughput sequencing, have consistently indicated that a large proportion of transcription takes place outside known gene boundaries (see
To systematically characterize the transcriptome in a particular brain region, cerebellar cortex, and identify its human-specific features, we performed high-throughput sequencing using the Illumina platform to analyze transcripts expressed in ten humans, four chimpanzees, and five rhesus macaques. All individuals are adult males (
For each sample, we obtained an average of ∼10,000,000 sequence reads of 36 nt corresponding to ∼7,200,000 unique sequences. From these reads, we can map on average 51% to the corresponding reference genomes and annotated exon junctions (
(A) Outer circle: average proportions of transcriptome sequence reads from the two human samples that map within annotated exons (green), introns (light orange), intronic repeats (orange), intergenic repeats (blue), intergenic regions (light blue), mitochondrial DNA (purple), and ncRNA (maroon). Middle circle: the proportions occupied by the corresponding regions in the human genome. Inner circle: the proportions of transcriptome sequence reads for polyadenylated human brain RNA (data adopted from
Within intronic and intergenic regions, more than half of transcription originates from repetitive sequence elements, occupying in total ∼42% of the entire transcriptome (
More than 90% of repeats present in the human genome result from transposable element (TE) activity taking place over hundreds of millions of years. Estimating the transcriptional activity of different TE families, we find that the most recently expanded ones, the Alu elements, show elevated transcriptional activity per genomic fraction occupied by the family (
Excluding repeats, intergenic regions contain 7% of all non-ribosomal human brain transcriptome sequences. These sequences are not distributed evenly, but concentrate within distinct regions (
(A) Examples of igHTR. The black track shows sequence reads density (in counts) in the four samples studied. The blue tracks show human EST density and PhastCon scores. (B) igHTR categories. The inner circle shows the proportion of igHTR with (red) or without (blue) EST support. The outer circle shows proportions of igHTR with protein-coding potential (green), supported by lincRNA (blue) or EvoFold (light blue) ncRNA predictions, adjacent to gene's 5′-UTR (light orange) or 3′-UTR (orange), and uncharacterized igHTR (grey) among EST-supported and non-supported igHTR. (C) Expression levels within intergenic regions (blue), genic regions including both exons and introns (light orange), exons (green), and igHTR (red). (D) Sequence conservation of nucleotides in human exons, genic regions, intergenic regions, and igHTR (all colors as on the panel (C)) based on phastCon scores among 18 placental vertebrates genomes. PhastCon scores close to 1 indicate high conservation. The heights of the bars show mean value and error bars show 95% confident intervals based on sampling of the same number of nucleotides as located within igHTR from the corresponding genomic regions 1,000 times. For igHTR, the values are based on all nucleotides located within them. (E) Size distributions of igHTR in the two human samples (red - Human1, blue - Human2), annotated human exons (grey), and exonic HTR (black) (F) Distributions of genomic distances between nearest pairs of igHTR (red – Human1, blue – Human2), annotated exons (black), and simulated randomly distributed igHTR (grey). The dashed line shows 10 kb distance. (G) Examples of splicing within igHTR clusters (red) and between annotated genes (blue) and downstream igHTR supported by EST (green).
Similar to humans, we can identify igHTR in chimpanzee and rhesus macaque brain transcriptomes. Expression levels of individual igHTR show significant positive correlation between the two human samples and among the three species (Spearman correlation,
Do igHTR represent extensions of known genes or independent coding and/or non-coding transcripts? The size distribution of transcription clusters shows two distinct peaks: a minor one at 45 nt and a major one at 500 nt (
With respect to the genomic location, igHTR tend to be situated within gene-rich regions, with 49% of human igHTR located within 10 kb of the nearest gene (simulation,
With respect to function, 251 genes that contain igHTR within 10 kb from the gene boundaries (204 of them are situated downstream for gene and may correspond to 3′-UTR extensions) show significant enrichment among GO terms
With respect to protein coding capacity, as determined by codon substitution frequencies (CSF)
To determine the extent of expression divergence between human, chimpanzee, and rhesus macaque brain transcriptomes, we first tested whether expression of known protein coding genes could separate species according to their phylogenetic relationship. Based on expression of 13,832 genes detected in at least two out of four samples in our dataset, we found that in agreement with previously reported results based on microarray data, gene expression differs significantly among the three species (
(A) UPGMA tree based on the expression level of 13,832 genes in 4 sample pools. The numbers at the nodes indicate node stability in 1,000 bootstraps over genes. (B) The gene expression divergence between sample pairs plotted against the species divergence time. The box plot represents variation of the divergence estimated from 1,000 bootstraps over genes (same set of genes as (A), see
Next, we identified genes with species-specific expression using a Bioconductor package for differential expression analysis of digital gene expression data, “edgeR”
Functional analysis of the 118 genes with human-specific expression did not yield significant results, but showed an enrichment trend among genes involved in transcriptional regulation (
In addition to gene expression differences, we compared the extent of expression divergence among the three species for different types of transcripts: exonic, intronic, intergenic, and repeats. To compare expression divergence of these different transcript types on the same basis, we used two approaches. In the first approach, in addition to igHTR, we identified all other highly transcribed regions (HTR) present in human, chimpanzee, and rhesus macaque brain transcriptomes and compared their expression levels across species. From a total of 16,159 HTR found among the three species, 10,654 (65.9%) correspond to exons, 904 (5.6%) to introns, 528 (3.3%) to intergenic regions, 3,007 (18.6%) and 1,066 (6.6%) to intronic and intergenic repeats, respectively. To identify the HTR with species-specific expression, we applied the methodology described above, based on the edgeR package. Following this approach, 24 HTR (11% in all species-specific HTR) can be classified as human-specific, 32 (15%) as chimpanzee-specific, and 159 (74%) as specific to rhesus macaque in the three species comparison (
In the second approach, we identified regions showing extreme species-specific divergence by comparing transcriptome coverage in a sliding window over the entire human-chimpanzee-macaque (HCM) genome alignment (
Our study, although based on a few samples, uncovers basic features of the brain transcriptome that are shared among the three primate species and identifies the most divergent expression patterns specific to the human brain. Among shared features, we find that exons alone contribute approximately a quarter of the total non-ribosomal transcriptome, while exons and introns together contribute three-quarters. Previously published human brain transcriptome sequencing data based on polyadenylated transcripts contains a higher proportion of exonic and a lower proportion of intronic transcription (54% and 24%, respectively,
While 42% of the human brain transcriptome originate within repetitive elements, most of the repeat expression is directly proportional to the occupied genomic length and, therefore, might represent “transcriptional background”. Some of the repeat families, however, are transcribed above the background level. While some of these families, such as snRNAs, snpRNAs and 7SK RNA that derived from functional ncRNA might be actively transcribed, high expression of simple and low complexity repeats is unusual. Notably, analysis of cap-selected mouse and human transcript tags across 12 tissues shows that simple and low complexity repeats have distinct tissue-specific expression profiles and are highly expressed in brain in both species
Besides repeats, intergenic transcription is highly non-uniform, containing distinct highly transcribed regions conserved between species both in terms of their expression and DNA sequence. A substantial proportion of these regions (23%) may represent alternative or extended 3′-UTR of known genes, enriched in conserved microRNA binding sites. In mouse brain, 3′-UTR extensions containing miRNA binding sites were found in microRNA-Argonaute complexes, indicating their role in miRNA-directed expression regulation
Another substantial proportion of identified intergenic transcripts (29%) overlap recently identified lincRNA and ncRNA predicted by EvoFold. Since our analysis is limited to highly expressed transcripts, most of them are expressed at higher levels than protein-coding genes. This indicates that at least some of these intergenic transcripts represent novel ncRNA functioning in the primate brain. We have to note, however, that these transcripts represent a small fraction of all identified lincRNA and ncRNA predicted by EvoFold: 1.7% and 0.3%, respectively. Thus, the vast majority of lincRNA and ncRNA predicted by EvoFold are not expressed in human cerebellum, or are expressed at levels below our igHTR detection threshold.
With respect to evolutionary features, the extent of expression divergence increases with greater species' phylogenetic divergence time. In our study, we do not observe an excess of expression divergence on the human lineage, previously reported in another brain region, cerebral cortex
Informed consent for use of the human tissues for research was obtained in writing from all donors or the next of kin. All non-human primates used in this study suffered sudden deaths for reasons other than their participation in this study and without any relation to the tissue used.
We dissected postmortem cerebellar cortex samples from ten male humans (8–54 years old), four male chimpanzees (8–40 years old), and five male rhesus macaques (4–20 years old). All human
Total RNA was extracted from dissections by Trizol reagent (Invitrogen, Carlsbad, CA) and treated for 30 min at 37°C with RNase free DNase I (Ambion, Austin, TX). RNA was purified with the RNeasy MinElute Kit according to the manufacturer's instructions (Qiagen, Valencia, CA). This procedure depletes RNA molecules with length shorter than 200 nt. Resulting RNA samples from five human, five macaque, or four chimpanzee individuals was mixed in equal proportions within species, resulting in two human, one chimpanzee, and one rhesus macaque pooled samples (
All raw sequencing reads were mapped to the corresponding reference genomes (hg18, panTro2, and rheMac2), allowing a maximum of four mismatches, using Short Oligonucleotide Alignment Program (SOAP, version 1)
We estimated the expression levels of repeat families based on uniquely mapped sequences. Including sequence reads mapped to multiple positions increased the total number of reads mapped to repeat regions by approximately 10%, but did not affect the results qualitatively. To normalize the expression by the lengths of unique DNA in each repeat family, we calculated the numbers of potential positions in repeat elements that can be mapped uniquely, then we summed up these numbers of all the elements and that of the expressed elements separately. This length calculation was done for both the analysis of repeat expression level
Pair-wise genome alignments of human-chimpanzee and human-macaque were downloaded from UCSC genome browser (genome versions: hg18, panTro2 and rheMac2). Based on these alignments, Human-Chimpanzee-Macaque (HCM) three-way alignment was constructed Using Multiz software package
We used two parameters to determine whether a region is a HTR. The first is the maximum spacing (maxspacing) between two neighboring reads (from 5 to 3′ on the forward strand). The second is the minimum number of mapped sequences (minhits) within the regions. For convenience, we use maxspacing = 150 nt and minhits = 10 for all HTR analysis shown in the paper, except
To calculate the expression correlation of individual igHTR in the three species, we unified igHTR identified in the four samples. We mapped igHTR identified in chimpanzee and macaque onto the human genome using the LiftOver tool from UCSC genome browser (
All simulation tests were done by randomly selecting the same number of genomic regions with the same length distribution as the actual igHTR 1,000 times. The sample genomic regions differed depending on the tested variable and are described specifically in each case (see Supplementary Information for details).
Sequence conservation analysis was based on the sequence conservation measures provided for each nucleotide position by the PhastCons conservation scores for 18-way multiple alignments between the human genome and 17 other placental mammalian species
We tested protein-coding potential of human igHTR by determining the maximum CSF (codon substitution frequency) score observed across the entire genomic locus, following
For overlap between lincRNA (large intergenic non-coding RNA) and igHTR, we used published lincRNA identified in mouse
For overlap between EvoFold predictions and igHTR, we download a total of 47,510 predicted RNA from UCSC browser
Among all annotated human protein-coding genes (Ensembl release 50), 18,391 can be matched between the three species based on HCM alignment. Out of these genes, 13,832 expressed in at least two of the four samples were used in this analysis. The gene expression levels were calculated as the number of sequence reads uniquely mapped in exons, normalized by the gene's exonic region length. Reads mapped to exon junctions were not counted here, because some exon boundaries might not been matched accurately between genomes based on HCM genome alignment. The expression levels were normalized across samples using quantile normalization (normalize.quantiles function in R)
To identify species-specific expression of genes, HTR, or GW, we used a Bioconductor package for differential expression analysis of digital gene expression data, “edgeR”
HTR were determined over the entire HCM alignment using standard parameters (maxspacing = 150 nt and minhits = 10) and assigned to the annotation categories according to the hierarchy mentioned above (
For 118 genes with human-specific expression, 251 genes containing igHTR (within 10 kb from the gene boundaries in both directions in the human samples), and for 204 (of 251) genes with igHTR near 3′-UTR, we performed GO-term/KEGG-pathway enrichment analysis using 15,263 genes expressed in at least one out of four samples as background. For the GO function enrichment analysis, we downloaded the Ensembl gene-GO annotation from the Ensembl database
To compare human-chimpanzee expression differences, we used expression data measured using Affymetrix arrays in three human and three chimpanzee adult cerebellar samples
We compare selective constrains in 118 genes with human-specific expression to that of 15,263 genes expressed in at least one out of four samples based on three measures: (1) Ka/Ks between human and mouse: the data was downloaded from Ensembl (release 50)
Composition of primate brain transcriptome
(0.14 MB DOC)
DNA length and transcriptional activity of repeat families
(0.25 MB DOC)
Relationship between age rank and transcriptional activity of transposable element families
(0.15 MB DOC)
The proportion of igHTR overlaps between the two human samples
(0.10 MB DOC)
Expression correlation of igHTR
(1.29 MB DOC)
Distribution of DNA sequence conservation scores at different igHTR expression cutoffs
(0.67 MB DOC)
Correlation of expression levels of igHTR that cluster within the genome
(0.09 MB DOC)
Connections between igHTR within clusters supported by EST
(0.04 MB DOC)
Expression correlation between igHTR and the nearest protein-coding gene
(0.23 MB DOC)
Number of igHTR/gene 3 prime end connections verified by at least one EST sequence
(0.15 MB DOC)
Number of conserved miRNA target sites in igHTR
(0.10 MB DOC)
Codon Substitution Frequency (CSF) score in different types of regions
(0.08 MB DOC)
Overlap between igHTR and lincRNAs
(0.10 MB DOC)
Overlap between igHTR and EvoFold predictions
(0.10 MB DOC)
Gene expression trees based on different measures of gene expression divergence
(0.16 MB DOC)
Gene expression divergence vs. evolutionary time
(0.27 MB DOC)
Genomic annotation of expressed regions within HTR
(0.21 MB DOC)
Genomic annotation of expressed regions within GW
(0.21 MB DOC)
Sample information
(0.02 MB XLS)
Numbers of sequence reads
(0.02 MB XLS)
Genomic coordinates, category and EST overlap of igHTR
(0.18 MB XLS)
GO term(biological process) enrichment analysis results
(0.02 MB XLS)
KEGG pathway enrichment analysis results
(0.03 MB XLS)
Species-specific Genes
(0.13 MB XLS)
Evolutionary conservation of human specific genes
(0.02 MB XLS)
Species-specific HTR
(0.11 MB XLS)
Annotation of genomic DNA expressed in HTR
(0.02 MB XLS)
Species-specific GW
(1.54 MB XLS)
Annotation of genomic DNA expressed in GW
(0.02 MB XLS)
We thank NICHD Brain and Tissue Bank for Developmental Disorders and H. R. Zielke in particular for providing the human samples, Suzhou Drug Safety Evaluation and Research Center and C. Lian, H. Cai and X. Zheng in particular for providing the macaque samples, E. Lizano, T. Giger and F. Xue for assistance, M. Kircher and J. Kelso for help with data analysis, H. Lockstone and J. Dent for editing the manuscript, M. Lachmann, M. Vingron, E. Green and all members of the Comparative Biology Group in Shanghai for helpful discussions. Finally, we thank three anonymous Reviewers for a very detailed and critical evaluation of our manuscript.