DB and CS conceived and designed the experiments and wrote the paper. DB performed the experiments. DB and RS analyzed the data. DB, RS, and DSM contributed reagents/materials/analysis tools.
The authors have declared that no competing interests exist.
The recent discovery of a new class of 30-nucleotide long RNAs in mammalian testes, called PIWI-interacting RNA (piRNA), with similarities to microRNAs and repeat-associated small interfering RNAs (rasiRNAs), has raised puzzling questions regarding their biogenesis and function. We report a comparative analysis of currently available piRNA sequence data from the pachytene stage of mouse spermatogenesis that sheds light on their sequence diversity and mechanism of biogenesis. We conclude that (i) there are at least four times as many piRNAs in mouse testes than currently known; (ii) piRNAs, which originate from long precursor transcripts, are generated by quasi-random enzymatic processing that is guided by a weak sequence signature at the piRNA 5′ends resulting in a large number of distinct sequences; and (iii) many of the piRNA clusters contain inverted repeats segments capable of forming double-strand RNA fold-back segments that may initiate piRNA processing analogous to transposon silencing.
The discovery of a new class of mammalian small regulatory RNAs termed PIWI-interacting RNA (piRNA) has extended the diverse family of small regulatory RNAs. PIWI proteins are a subclass of the larger Argonaute proteins family, of which the Ago members bind microRNAs and play a critical role in gene silencing. Despite the homology between PIWI and Ago proteins, piRNAs are strikingly different from microRNAs in their length, expression pattern, and genomic organization. In contrast, piRNAs are similar to repeat-associated small interfering RNA (rasiRNAs), a class of small RNAs that are responsible for transposon silencing in
A recent landmark discovery has identified a novel class of small RNAs in mammalian testes that is expressed during spermatogenesis [
To better understand the origin of piRNAs, we compared the available three largest mouse piRNA datasets (identified at the pachytene stage of spermatogenesis) in terms of sequence similarities and cluster organization. Given the comprehensive nature of these efforts and the focus on a common specific stage in mouse spermatogenesis, we expected close agreement between the datasets. Indeed, the three groups report
Although the three studies identified the same piRNA clusters, they are distinct at the level of piRNA sequences.
(A) View of Chromosome 5 piRNA sequences and clusters from datasets A, B, C. Top panel (1) is the karyotype view with cluster positions of the datasets: A (green lines), B (top purple triangles), and C (bottom yellow triangles). Lower panels (2–5) are magnified views of the sequences and cluster locations from the three datasets. Top three tracks in each panel are the sequence locations from datasets C (yellow), B (purple), and A (green), and lower three tracks are the cluster positions in the same color scheme. The Venn diagram of the cluster overlaps (B) shows a good agreement between the datasets while sequence overlaps, using 95% identity measure, are small (C). Note that the number of piRNAs used in this comparison is different from the number of sequences reported in the original studies (see
We further show that 25% of piRNA clusters are bracketed by inverted repeats of varying length, suggesting that some of the long piRNAs single-stranded precursors [
Five groups reported the discovery of small RNAs expressed exclusively in mammalian testes (mouse, rat, and human) that bind MIWI (murine PIWI) or MILI proteins [
The short length of piRNAs and the structural homology between PIWI and Argonaute proteins are suggestive of functional similarities between piRNAs and microRNAs. However, the combined evidence indicates that both the biogenesis and function of these two classes of RNA are distinct (
Comparison of microRNAs and piRNAs
The majority of piRNAs (81%–96%) is organized in clusters (
piRNAs bind MILI and MIWI proteins, which are members of the PIWI protein family, a subclass of the Argonaute family. In eukaryotes, Argonaute proteins are key components of the interfering RNA pathway in which they bind mature microRNAs or siRNAs to form the RNA-induced silencing complex (RISC) [
rasiRNAs are a class of interfering RNA with a size distribution of 23–28 nucleotides that were identified in a number of organisms [
The length characteristics, testis-specific expression, PIWI interaction, genomic organization, and 5′ uridine enrichment suggest that piRNAs may be the mammalian equivalent of rasiRNAs. This would support the idea that mammalian piRNAs might be involved in silencing transposable elements. However, at present, there are a number of differences that cast doubt on this functional analogy. First, genomic annotation of piRNAs indicates that only 12%–20% are repeat derived [
The discovery of large sets of piRNAs raises a number of important biological questions. In particular, what is the biochemical role and cellular function of PIWI-bound piRNAs during spermatogenesis? Are they involved in transposon silencing, chromosome rearrangements (as are 30-nt PIWI-bound RNAs in
We first compared the cluster locations in the mouse genome from datasets A–C and found extensive agreement between the datasets. The majority of clusters overlap by more than 75% of the length of the shorter cluster. All 42 genomics clusters from dataset C, the smallest of the three, matched clusters of datasets A and B (
We compared the sets of individual sequences from the three groups (A–C). Contrary to the agreement between clusters, we found surprisingly small overlaps between the sets of unique sequences, irrespective of the criteria used for sequence comparison (100%, 95%, or 90% sequence identity,
This small overlap between the piRNA datasets points to an apparent contradiction—how can different sets of piRNA sequences originate from a common set of genomic clusters? The simplest explanation is that each experiment identified only a subset of sequences from a larger pool of unique piRNA sequences. To quantify this effect, we first asked whether the observed overlaps are within the expected range assuming that the complete piRNA pool is simply the union of the three datasets. To facilitate the comparison we restricted this analysis to the intersection of clusters from the three datasets, termed “intersection clusters” (
From this estimate we conclude that the current datasets analyzed here have so far identified only 25%–30% of all potential piRNA sequences from the pachytene stage of mouse spermatogenesis. This implies that in the complete set ∼20%–25% of all “U” positions in the clusters are potential start sites for piRNA sequences when taking into account the pronounced preference for 5′ uridine. Extrapolating to saturation in all clusters reported by any of the three groups, we arrive at the overall conservative estimate of
We estimate that the total number of piRNAs in mouse testes is ∼2 × 105 (red), roughly four times the number of currently known piRNAs (blue). The estimated number of piRNAs corresponds to ∼23% of all “U” positions (green) or 5%–6% of all nucleotides (yellow) in piRNA clusters.
The details of piRNA biogenesis are not yet known. In particular, what is the precursor form of piRNAs? Is it single-strand or double-strand? What are the components of the nuclease-processing complex? By which mechanism, in which order, and under which regulatory control do thousands of different ∼30 nt transcripts originate from a limited number of genomic regions? The large differences in piRNA datasets and the relatively weak evolutionary conservation of piRNA sequences suggest that the processing of piRNAs from a primary precursor is not a precise step, in contrast to microRNA maturation. Instead, it appears, to a first approximation, that piRNAs are generated by a random mechanism in which any U position is a potential 5′ piRNA start. This notion is supported by the fact that sequence overlap between the datasets remains low even when we compare only the more abundant sequences (
We therefore attempted to identify a distinguishing sequence signal that predicts which U bases are 5′ piRNA cleavage sites. Using a sequence classification algorithm, we identified, with 61% accuracy, the correct 5′ U piRNA sites from all other U positions using both 10-fold cross-validation on the training set and by testing on randomly withheld test set excluded from training (see
piRNAs originate from long RNA precursors transcribed from a small number of genomic regions (A). Some clusters contain inverted repeats that can potentially form dsRNA fold-back structures. In this genomic view of a cluster chr2: 150870000–150910000 (B), the inverted repeats are represented as linked colored bars. These inverted repeats originate from inverted LINE transposable elements that flank the piRNA cluster (red and blue bars in the LINE track). A long transcript containing the pair of inverted LINE elements can potentially form a precursor with a dsRNA segment (C). piRNAs are processed by a quasi-random mechanism with a weak sequence preference near the 5′ U that is most pronounced in frequent clones (D).
The precursor form of piRNA primary transcript— single- or double-stranded—is currently unknown. However, the strong 5′ uridine bias and the presence of the 5′ phosphate group [
In
Recent studies propose that mammalian piRNAs may be involved in transposon silencing analogous to
We hypothesize that one plausible model of piRNA biogenesis involves long transcripts that contain flanking inverted transposable elements, one at each end of the cluster (
We cannot exclude the possibility that the bracketing inverted transposable elements are not part of the primary transcript but simply the result of statistical coincidence. In fact, similar numbers of such repeats are found in randomly chosen genomic regions (unpublished data), as remnants of transposable elements account for over a third of the mouse genome [
The novel discovery of piRNAs has extended the multifaceted family of small interfering RNAs that includes microRNAs, siRNAs, and rasiRNAs. The tens of thousands of distinct mouse piRNAs observed so far map to ∼117 distinct genomics locations in the genome. The details of piRNA transcriptional control, such as promoter sites and transcription factors, remain to be determined. Our analysis has revealed low sequence overlap between the currently known pachytene-stage mouse piRNA datasets, although the sequences originate from a common set of genomic clusters. This apparent contradiction is resolved by noting lack of saturation in each individual experiment. We interpret the low sequence overlap as suggestive of quasi-random sub-saturation processing from common precursors, such that different experiments yield different and only partially overlapping sets of piRNAs. In addition, based on the observation of repeat structures bracketing some of the clusters, we propose that one plausible mechanism for initiation of piRNA biogenesis involves long transcripts with terminal inverted repeats, possibly derived from (remnants of) transposable elements. Such transcripts may form partial dsRNA intermediates initiating enzymatic degradation. Subsequent stages of piRNA biogenesis may then follow the ping-pong model proposed by [
The notion that piRNAs both direct the degradation and are the degradation products of their own precursors suggests that piRNA transcripts are under strict regulation at a crucial stage of meiosis. What is their function? The PIWI proteins are highly expressed in the pre-pachytene and pachytene stages of meiosis when chromosome pairing is completed (zygotene) and synapsis is peaked. This raises the intriguing possibility that the transcripts from which the piRNAs derive, and/or the piRNAs themselves, are involved in one of the crucial processes of meiosis, correct chromosome pairing, for which the molecular mechanism remains a mystery. The connection between this and the proposed piRNA function of transposon silencing remains to be elucidated. We look forward to directed biochemical and genomic experiments that will invalidate or confirm the models proposed here and explain the function of piRNAs.
Mouse piRNA sequences were collected from the following sources: Dataset A from Lau et al. [
Duplicate and subsequences were removed from each dataset at 100% nucleotide identity (
−W = 6 − X = 50 − gapX = 50 − S2 = 50 − gapS2 = 50 − hspmax = 1,000 − gspmax = 1,000 − E = 1,000 − filter = none.
Over 90% of the sequences mapped to unique genomic locations. In the remaining cases where there was more than one match to the genome, all positions were considered as a possible origin of the piRNA.
Coordinates of piRNA clusters from dataset C were translated from mm6 to mm7, in some cases resulting in a change in cluster length due to partial mapping:
mm6|chr9|+|67822641|67883254
mm7|chr9|+|67751406|67785923
mm6|chr14|+|22446408|22484616
mm7|chr14|+|21745838|21783387
mm6|chr9|+|54305216|54360650
mm7|chr9|+|54231430|54253257
mm6|chr17|+|63838569|63952874
mm7|chr17|+|64406371|64449447
The datasets were not significantly biased to specific sequences or nucleotide composition by experimental protocol. The two larger datasets (A and B) were produced using similar ligation adaptors and sequencing methods excluding the possibility of sequence bias due to different methodologies. Indeed, we found no differences in mononucleotide or dinucleotide frequencies between the datasets.
Overlaps between genomic clusters from different datasets were determined by intersection of their genomic locations. The length of the overlaps ranged from 19% to 100% of the shorter cluster. In the majority (70%) of the overlapping clusters, the extent of the overlap covered >75% of the length of the shorter cluster. Instances where two clusters from one dataset overlapped a single cluster from another dataset were counted as one overlap. Intersection clusters were defined as the genomic regions where clusters from all three datasets overlapped (See
Sequence comparison was performed as follows: All sequences (after initial processing) from all datasets were combined and compared all-against-all using WU-BLAST and in-house software. Sequences were grouped into similarity sets by hierarchical clustering and a defined identity measure. To explore sensitivity of the analysis to variation in parameters, we performed three clustering procedures using these identity measures: (i)100% sequence identity over the entire length of the shortest sequence; (ii) 95% sequence identity over 95% length of the shortest sequence; and (iii) 90% sequence identity over 90% length of the shortest sequence. Considering all sequences in a similarity cluster to be essentially identical, the degree of overlap between two datasets is determined by counting the number of similarity clusters that contain sequences from both datasets (
Human piRNA sequences were retrieved from Girard et al. (dataset B) and Aravin et al. (dataset C) studies. Similarly to mouse piRNAs, sequences that matched known small RNAs and mRNAs were removed resulting in 9,600 unique piRNA sequences from dataset B and 120 sequences from dataset C. Sequences comparison was performed as outlined above. Under 95% sequence identity measure, 29 sequences were shared between the two datasets corresponding to ∼24% of dataset C sequences.
The degree of overlap between two independent datasets, say
The value of total number of piRNAs
For each intersection cluster we computed three estimates of
The total number of piRNAs was computed as the average of the three approximations summed over all clusters:
To approximate the total number of piRNAs in the mouse genome we extrapolated the total number in all intersection clusters, to the union of all clusters from datasets A, B, and C (
Sequences assigned to genomic positions were sorted by chromosomal position. The distance between two adjacent sequences
To identify a distinguishing signal for 5′ piRNA processing in cluster regions, we trained a support vector machine classifier to discriminate between 5′ piRNA and all other uridine positions.
Positive set included all of the piRNAs 5′ uridine positions extended ten bases upstream and downstream; a total of 24,604 sequences. Similarly, the negative set was constructed by selecting random non-piRNA uridine positions in the intersection clusters and ten nucleotides upstream and downstream. Both sets were split into two, one part used for training and the other for testing. Feature vectors were constructed by converting the 21-base sequences into 84-bit vectors (21 nt × 4 bases), i.e., each nucleotide position is converted to a 4-bit vector representing the RNA base.
Support vector machine training and classification was performed using an R interface of “libsvm” (
For detection of inverted repeats in the vicinity of cluster, sequences were collected from the union clusters (
(100 KB PDF)
(1.2 MB PDF)
(18 MB PDF)
(66 KB PDF)
(60 KB PDF)
(19 KB XLS)
(22 KB XLS)
(31 KB XLS)
We are grateful to Boris Reva, Sven Nelander, Nikolaus Schultz, and Tom Tuschl for comments, and to Greg Hannon for early access to piRNA sequences.
double-strand RNA
murine PIWI
PIWI-interacting RNA
repeat-associated small interfering RNA
small interfering RNA