GL conceived and designed the experiments, performed the experiments, and analyzed the data. GL, CPP, and JH contributed to ongoing discussions on molecular evolution and comparative genomics. GL and CPP wrote the paper.
The authors have declared that no competing interests exist.
It has become clear that a large proportion of functional DNA in the human genome does not code for protein. Identification of this non-coding functional sequence using comparative approaches is proving difficult and has previously been thought to require deep sequencing of multiple vertebrates. Here we introduce a new model and comparative method that, instead of nucleotide substitutions, uses the evolutionary imprint of insertions and deletions (indels) to infer the past consequences of selection. The model predicts the distribution of indels under neutrality, and shows an excellent fit to human–mouse ancestral repeat data. Across the genome, many unusually long ungapped regions are detected that are unaccounted for by the neutral model, and which we predict to be highly enriched in functional DNA that has been subject to purifying selection with respect to indels. We use the model to determine the proportion under indel-purifying selection to be between 2.56% and 3.25% of human euchromatin. Since annotated protein-coding genes comprise only 1.2% of euchromatin, these results lend further weight to the proposition that more than half the functional complement of the human genome is non-protein-coding. The method is surprisingly powerful at identifying selected sequence using only two or three mammalian genomes. Applying the method to the human, mouse, and dog genomes, we identify 90 Mb of human sequence under indel-purifying selection, at a predicted 10% false-discovery rate and 75% sensitivity. As expected, most of the identified sequence represents unannotated material, while the recovered proportions of known protein-coding and microRNA genes closely match the predicted sensitivity of the method. The method's high sensitivity to functional sequence such as microRNAs suggest that as yet unannotated microRNA genes are enriched among the sequences identified. Futhermore, its independence of substitutions allowed us to identify sequence that has been subject to heterogeneous selection, that is, sequence subject to both positive selection with respect to substitutions and purifying selection with respect to indels. The ability to identify elements under heterogeneous selection enables, for the first time, the genome-wide investigation of positive selection on functional elements other than protein-coding genes.
Despite the major impact of sequencing the human genome on our understanding of biology, a fundamental problem remains. Many of the genome's functional elements, particularly those that do not encode protein, are proving difficult to distinguish from neutrally evolving DNA. Lunter et al. introduce a method that exploits the evolutionary imprint of sequence insertions and deletions (so-called indels) to pinpoint functional DNA regions that have been subject to purifying selection. This method hinges on a simple theoretical prediction for the distribution of indels across the human genome. Despite its simplicity, the model shows an excellent fit to human and mouse alignments. This tight fit has been exploited to show that virtually all ancient transposable elements are evolving neutrally, which has long been suspected but not quantified. Indeed, the model estimates the probability that, among all alignable human sequence, a region has been purged of deleterious indels since the human–mouse split. This leads to the prediction that between 2.56% and 3.25% of the human genome sequence is functional. Importantly, the method is independent of conventional nucleotide substitution approaches, and thus immediately presents an initial opportunity to investigate the impact of positive selection on non-coding functional elements.
The human genome has been shaped by the evolutionary forces of mutation, genetic drift, and selection, with the latter acting, in the main, to purify functional regions of deleterious mutations. By comparing the human and mouse genomes, previously it was estimated that about 5% of the human genome has undergone fewer point mutations than expected under a neutral substitution model [
To begin to understand the biological role of the remaining non-genic functional elements, the essential first step is their identification. Recent studies have focussed on the most highly conserved of these elements, namely ultraconserved elements (defined as segments of >200 base pairs [bp] without substitutions) [
Of all mutation processes, point substitutions are the most prevalent, with insertions and deletions (indels) approximately 10-fold less frequent. While nucleotide substitution models have been studied intensively [
We first applied this neutral indel model to derive upper and lower bounds on the proportion of genome under purifying selection with respect to indels (indel-purifying selection). Our observations can be explained by proposing that between 78.8 ± 0.6 Mb and 100.0 ± 0.8 Mb (2.56%–3.25%) of the human genome has been under indel-purifying selection since the human–mouse split. Although still much higher than the 1.2% represented by coding exons, this represents a substantially lower estimate than the previous 5% estimate based on substitution-level conservation [
As a second application of the neutral indel model, we identified a large proportion of sequence elements that have evolved under indel-purifying selection. The model allowed us to calculate the predicted false-discovery rate (FDR) for the entire set, as well as Bayesian posterior probabilities for individual elements to be under indel-purifying selection. By correlating this set with various independent functional indicators, both positive (for example, overlap with, or close proximity to, known exons) and negative (TE annotation), it is shown to be highly enriched with functional DNA.
The key strength of the proposed method lies in its independence of selection with respect to point mutations. Consequently, the method can provide independent confirmation of selection, thereby improving the specificity of methods based on substitutions alone. Moreover, an exciting possibility is that the method allows identification of sequence elements that have been under heterogeneous selection, i.e., that have been subject to purifying selection with respect to indels, but subject to positive selection or relaxed constraints with respect to substitutions. Examples of such elements would include spacers between regulatory elements whose relative distance is functionally constrained, such as those shown to exist in
The neutral indel model hinges on two assumptions: that distinct indel events are independent, and that they occur uniformly across the genome. The first assumption likely holds to high accuracy, but indel rate uniformity can only be expected to be approximately valid; we thus eventually account for indel rate variation in the later analysis (see the section Accounting for Indel Rate Variation). However, accepting both assumptions as a first approximation, we can immediately draw the conclusion that the distance between successive indels, measured as the number of homologous nucleotides surviving in between, follows a geometric distribution. Note that this conclusion holds irrespective of the distribution of indel lengths themselves, and of the relative incidence of insertions and deletions.
The fact that indel events often involve several nucleotides simultaneously introduces co-dependencies in the survival probabilities of nearby sites. In other words, the probability that an ancestral nucleotide survives as a homologous nucleotide in two descendant species is dependent on whether neighbouring nucleotides survive. However, assuming independence of indel events, survival probabilities do become independent conditional on the survival of the left (or right) neighbour. Indeed, if
Although indels cannot be observed directly, for the low indel rates observed in mammals they closely correspond to gaps in the alignment. It thus may be predicted that, under neutrality, the lengths of ungapped sequence between successive alignment gaps—intergap segments (IGS)—would be distributed similarly to the geometric distribution predicted for the distance between successive indels. A whole-genome histogram of IGS lengths, obtained from BlastZ human–mouse alignments [
Histogram of intergap distance counts (log10 scale) in human–mouse alignments, (A) within the whole genome and (B) within ARs. Blue lines indicate predictions of the neutral model (central line, geometric distribution; the slope is related to the per-site indel probability
Outside of the range of 20–50 bp, histogram counts deviate from the neutral model predictions, with IGS of less than 20 bp being underrepresented, and IGS longer than 50 bp being overrepresented. The underrepresentation of short intergap distances is caused by a systematic alignment artefact termed gap attraction [
To investigate whether the overrepresentation of long ungapped segments is, to a large extent, caused by indel-purifying selection, a similar histogram was constructed using only alignments of ARs (see
To quantify the extent of any deviation of the intergap histogram from the neutral model, we introduced a parameter
Generally, a local reduction of the effective indel rate (i.e., the indel rate resulting from mutation and selection combined) gives rise to an overrepresentation of long ungapped segments, as measured by
To partially account for this, we divided the human genome into 20 bins on the basis of G+C content within 250-bp windows, adjusting thresholds to make bins contain equal fractions of the genome, and IGS-length histograms were generated for each bin (
Intergap distance histograms, per G+C content bin, for all of the autosomes and the Y chromosome (Left hand columns) and restricted to ARs within these chromosomes (Right hand columns). Horizontal axes, inter-gap distance (nucleotides); vertical axes, log10 counts. Red anchors denote the segment over which the weighted linear regression was performed to determine the neutral model's indel rate parameter ρ (central blue curve). An overrepresentation of long ungapped segments is apparent in all whole-genome histograms, and especially for higher G+C contents. In contrast, the histograms that include only AR data show a tight fit to the neutral model, with only modest overrepresentation of long segments.
(A) Whole genome (blue) and AR (red) averages of indel rates. Error bars denote 95% confidence intervals in ρ as determined by weighted linear regression on log frequencies in the intergap length histogram.
(B) Indel rates per G+C content for individual chromosomes (error bars not included for clarity), and autosomal averages (whole autosome, blue; ARs, red). Most autosomes have undergone similar indel rates, with mildly increased rates for the small chromosomes (22 and 19 in particular), and a marked reduction for X, as expected by its distinct germline history. Because of its size, measurements on the Y chromosome lack accuracy, but are consistent with an increase in indel rates.
A second expected source of indel rate variation is germline history, which is different for sex chromosomes and autosomes. To test this, indel rates were measured for each chromosome separately (
Accounting for indel rate variation as described above thus reduces overall
Investigating the AR histogram more closely, we observed a number of remarkably long ungapped segments, with 25 of these longer than 500 bp. All 25 align to mouse fragment sequence that is unplaced within assembled mouse chromosomes and that shows extraordinarily few (<1%) substitutions compared with human, whereas the corresponding dog, rat, and chimpanzee sequences exhibit gap and substitution patterns that are consistent with neutral evolution. It thus appears likely that these segments represent contaminants, and are primate, not mouse, sequence. A whole-genome scan identified 146 fragments exhibiting similar characteristics, contributing 285 kb to the human–mouse alignment (0.03%,
The stratification of the genome by G+C content allowed the distribution of material under indel-purifying selection by G+C content to be investigated, using
Vertical axis shows
The results above imply that less than 1.2 Mb of human DNA annotated as TEs is unaccounted for by the neutral model (a proportion 0.0067 of 177 Mb human–mouse ARs, or less than 0.09 % of all human TEs). A fraction of this will be due to residual indel rate variation which has not been accounted for in the analysis. Other contributions may include non-orthologous alignments, which are more prevalent for repetitive sequence, and misannotations. Finally, a fraction of the 1.2 Mb TEs may be truly under indel-purifying selection, and thus have (or have had) a functional role in human biology.
Bounds on the proportion of material under indel-purifying selection in the human genome were then derived. To do this, we had to account for the fact that not all DNA in long ungapped segments is expected to be under selection owing to the relatively low density of indels. As a simple model, the genome was considered to consist of segments of functional material that is purified of any indel, separated by neutral material that accepts all indels (
Indel events (modeled as point events, and represented by arrows) affecting functional DNA (red) are purified from the population and are not observed in extant species. The remaining indels (green arrows) delineate ungapped segments. Those subtending a segment of functional DNA (dark blue) are longer than the functional element itself, and the amount of neutral sites included in these long ungapped segments is on average twice the expected distance between indels on neutrally evolving DNA (see
To derive the upper bound on the proportion of indel-purified human DNA, it was assumed that only purifying selection contributes to the observed whole-genome
For the lower bound, the observed
The resolution for identifying DNA under indel-purifying selection is limited by the relatively low human–mouse indel rate of one per 16–22 surviving homologous sites. To improve resolution, the dataset was augmented by the dog genome. This choice was motivated by the high quality of the dog assembly [
We identified a set of segments highly enriched with indel-purified DNA by setting thresholds on the length of ungapped segments. The neutral indel model was used to predict the number of segments expected to exceed any length threshold under neutrality, from which we calculated the FDR. Adjusting for neutral overhang and false positives, we computed the amount of material under indel-purifying selection among the identified segments, and from this the sensitivity was estimated (see
Axes show predicted proportion of neutral nucleotides (horizontal) and proportion of identified nucleotides among mouse-aligning nucleotides within annotation class (vertical). Red, yellow, and green curves show partial sensitivity to (known or likely) functional DNA, with the predicted sensitivity to DNA under indel-purifying selection (blue curve) following their general trend. For a fair comparison, the partial sensitivities were computed relative to the material common to human, mouse, and dog. The purple curve charts the sensitivity for neutrally evolving ARs, for comparison. Note that the false positive fraction (relative to mouse-aligning neutral elements) is considerably lower than the predicted FDR (relative to the identified set). Converting to the false positive fraction, we calculate the area under the resulting receiver-operating-characteristic curve to be high at 0.93, indicative of the method's discriminatory power.
At a 1% FDR, we obtained 54.44 Mb of human DNA that is refractory to indels, which includes 64.0% (23.44/36.61) Mb of known coding exons [
Besides these known elements, the majority of the segments identified consists of currently unannotated sequence (66.6% at 10% FDR), which we predict to predominantly represent DNA that has been under indel-purifying selection. This prediction, implied by the predicted FDR and sensitivity, is supported by the distribution of the identified elements with respect to various annotations (
Annotation of Sequence under Indel-Purifying Selection
To investigate the relationship between purifying selection with respect to either substitutions or indels, we computed the empirical distribution of percent nucleotide identity (PID) of aligned human and mouse sequence, using the segments identified at the 1% FDR level (
Shown is the PID distribution for human segments under indel-purifying selection (at a 1% FDR; blue), and a background distribution obtained on putatively neutrally evolving segments (non-exonic, and not in identified set of segments at 10% FDR; grey). The blue distribution can be decomposed as a mixture of 6% background (shaded) and a remainder (red), suggesting that a proportion of ungapped elements (≈ 5%, mixture coefficient minus FDR of 1%) are under purifying selection with respect to indels, while evolving under relaxed constraints or positive selection with respect to substitutions (see
We have introduced a simple model for the neutral genomic distribution of indels, predicting a geometric drop-off in the frequency of intergap distances across whole-genome alignments. This prediction was observed to hold nearly exactly in human–mouse alignments across a range of intergap distances (20–50 bp). By realigning human Chromosome 21 sequence, this effect was shown to be independent of the alignment procedure, and appears instead to reflect the signature of past mutation. The distribution of between-indel distances within human ARs was shown to agree closely with the neutral model predictions (no statistically significant deviations across the range 20–80 bp), and the total amount of ARs under sustained purifying selection with respect to indels was shown to be at most 1.2 Mb, or 0.09% of all TEs.
A few examples of co-opted and functional TEs are known [
Substantial indel rate variation with local (250-bp) G+C content was observed, with up to 35% increased indel rates for both high and low G+C content. This observation is consistent with polymerase slippage as a main cause of indels, since extremes in G+C content imply higher expected sequence similarities, thereby facilitating slippage. Indel rates also vary with chromosome type, with X having a 15% lower average indel rate than the autosomes. For substitutions, a similar pattern has been observed [
In contrast to ARs, the distribution of inter-indel distances for the whole genome significantly departs from the neutral model, exhibiting a large excess of long (>>50-bp) ungapped segments. Because purifying selection is expected to result in long ungapped segments, and because at most a vanishing fraction of ARs is believed to have been under purifying selection, in stark contrast to the rest of the human genome, these observations are consistent with purifying selection being the predominant cause for this departure. Variations in indel rates that were not accounted for in principle also will cause deviations of this qualitative type. However, ARs are ubiquitous in the human genome, and large-scale rate variations would also influence the distribution of indels within ARs. Local rate variations due to sequence features specific to AR (or non-AR) DNA may also give rise to residual rate variations, and indeed small differences in indel rates were found between ARs and general genomic DNA, after accounting for G+C content. These differences may have various causes, including differences of G+C content distribution within bin thresholds and indel rate variations due to differences in sequence composition other than G+C content. However, simulations showed that rate variations as large as 35% cause a departure from the neutral model of almost three orders of magnitude less than that observed for the whole genome (
Using several conservative assumptions, the total amount of DNA under indel-purifying selection was estimated to be between 2.56% and 3.25% of human euchromatin. These estimates are lower than an earlier estimate of 5% based on an analysis of nucleotide substitutions between human and mouse [
We identified a set of ungapped segments highly enriched with material under indel-purifying selection, and containing a small and predetermined fraction of neutrally evolving segments (FDR). This set was found to be highly enriched with coding exons, known microRNAs, and DNA-sharing homology with chicken and
While most of the material identified is unannotated, the low overall density of ARs, the high average PID to mouse, and the high density of chicken- and
The method's sensitivity to microRNAs and protein-coding exons is remarkable for a method that uses neither structural nor evolutionary models particular to exon or microRNA sequence, nor any substitution-based conservation approach, and suggests that the present method will be advantageous as part of a computational gene or microRNA discovery tool. The simplicity of the proposed method easily allows other signals to be included, too. For example, functional material is expected to be highly clustered, and indeed a high degree of clustering was found among the segments we identified (50% of identified segments are within 250 bp of another, and 20% are within 10 bp; expected proportions for a uniform distribution are 4% and 0.2%, respectively). One way of exploiting this would be to consider consecutive indels, and derive a neutral model for such configurations. Finally, although indel spectra vary considerably between organisms [
Analyzing the pattern of substitutions among the identified elements, an unexpectedly large fraction was found whose conservation with respect to substitutions was indistinguishable from neutrality (about 6%, within a set predicted to contain 1% false positives). This result could be explained if, by a failure of the model, the false positive rate was grossly underestimated. However, considering only those elements exhibiting subneutral conservation with respect to substitutions, it was found that the resulting set, although naturally enriched with false positives, still contained an appreciable fraction of functional elements, as indicated by a strong depletion of ARs, and an enrichment with coding exons and
The full data set of identified segments, at both 1% and 10% FDRs, is available for download and visualisation in genome browsers at
We installed mirrors of four sets of BlastZ alignments [
IGS were defined as aligned segments of homology, uninterrupted by gaps in any of the two alignment tracks (or three in case of human–mouse–dog alignments). The neutral model was fitted to the observed histogram counts by weighted linear regression on the log frequencies, with weights derived from the expected sampling error per length bin (binomial distribution) in log-space. The length intervals over which this regression was performed were determined by maximizing the coefficient of determination
Histograms on the AR portion of the genome were obtained by intersecting IGS with segments annotated as AR by RepeatMasker [
Realignment was performed using a probabilistic aligner, implementing a pair hidden Markov model (p. 82 of [
The expected distance to the nearest indel downstream (or upstream) of any site is
For a given FDR (defined as the predicted proportion of neutral segments among those identified, weighted by sequence length), we obtain thresholds on ungapped segment length for all G+C bins, and for the X chromosome and other chromosomes separately, by constrained maximization of the predicted total amount of identified segments under selection, using the method of Lagrange multipliers. The predicted sensitivity was computed by adjusting for the contribution of neutral sites using the upper bound method described, and dividing by the whole-genome upper bound of 100 Mb.
We computed the histogram of
Horizontal axis, IGS length; vertical axis, log10 of histogram counts. Blue curves, weighted linear regression fit (centre curve; interval 16–77), and 95% confidence limits for the histogram counts under the model (outer curves).
(A) Histogram of IGS truncated to AR boundaries (
(B) Same histogram after stochastic extension of truncated IGS overlapping the rightmost end of ARs (see
(67 KB DOC)
Histogram of intergap distance counts (log10 scale) in human–mouse–dog alignments, (A) within the whole genome and (B) within ARs. See
(236 KB DOC)
These 146 fragments, of which 285 kb align to human DNA in the BlastZ alignments we used, most likely represent sequence contamination from primates and were thus subsequently excluded from our analysis.
(48 KB DOC)
We thank Manolis Dermitzakis for helpful discussions, and Andrea Rocco for the implementation of the realignment algorithm. This work was funded by the MRC UK, grant HAMKA.
ancestral repeat
base pairs
false discovery rate
intergap segments
percent nucleotide identity
transposable element