Current address: HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, United States of America.
Conceived and designed the experiments: EVD GMC AS SB. Performed the experiments: EVD DLG MS. Analyzed the data: EVD DLG MS. Wrote the paper: EVD DLG MS AS SB.
The authors have declared that no competing interests exist.
Computational efforts to identify functional elements within genomes leverage comparative sequence information by looking for regions that exhibit evidence of selective constraint. One way of detecting constrained elements is to follow a bottom-up approach by computing constraint scores for individual positions of a multiple alignment and then defining constrained elements as segments of contiguous, highly scoring nucleotide positions. Here we present GERP++, a new tool that uses maximum likelihood evolutionary rate estimation for position-specific scoring and, in contrast to previous bottom-up methods, a novel dynamic programming approach to subsequently define constrained elements. GERP++ evaluates a richer set of candidate element breakpoints and ranks them based on statistical significance, eliminating the need for biased heuristic extension techniques. Using GERP++ we identify over 1.3 million constrained elements spanning over 7% of the human genome. We predict a higher fraction than earlier estimates largely due to the annotation of longer constrained elements, which improves one to one correspondence between predicted elements with known functional sequences. GERP++ is an efficient and effective tool to provide both nucleotide- and element-level constraint scores within deep multiple sequence alignments.
There are millions of sequences in the human genome that perform essential functions, such as protein-coding exons, noncoding RNAs, and regulatory sequences that control the transcription of genes. However, these functional sequences are embedded in a background of DNA that serves no discernible function. Thus, a major challenge in the field of genomics is the accurate identification of functional sequences in the human genome. One approach to identify functional sequences is to align the genome sequences of many divergent species and search for sequences whose similarity has been maintained during evolution. We have developed GERP++, a software tool that utilizes this “comparative genomics” approach to identify putatively functional sequences. Given a multiple sequence alignment, GERP++ identifies sites under evolutionary constraint, i.e., sites that show fewer substitutions than would be expected to occur during neutral evolution. GERP++ then aggregates these sites into longer, potentially functional sequences called constrained elements. Using GERP++ results in improved resolution of functional sequence elements in the human genome and reveals that a higher proportion of the human genome is under evolutionary constraint (∼7%) than was previously estimated.
The identification and annotation of all functional elements in the human genome is one of the main goals of contemporary genetics in general, and the ENCODE project in particular
Several computational methods for constrained element (CE) detection have been developed, with most falling into one of two broad categories: generative model-based approaches, which attempt to explicitly model the quantity and distribution of constraint within an alignment, and bottom-up approaches, which first estimate constraint at individual positions and then look for clusters of highly constrained positions. A widely used generative approach, phastCons
One of the leading bottom-up approaches is GERP
In this work we present GERP++, a novel bottom-up method for constrained element detection that like GERP uses rejected substitutions as a metric of constraint. GERP++ uses a significantly faster and more statistically robust maximum likelihood estimation procedure to compute expected rates of evolution that results in a more than 100-fold reduction in computation time. In addition, we introduce a novel criterion of grouping constrained positions into constrained elements using statistical significance as a guide and assigning p-values to our predictions. We apply a dynamic programming approach to globally predict a set of constrained elements ranked by their p-values and a concomitant false positive rate estimate. Using GERP++ we analyzed an alignment of the human genome and 33 other mammalian species, identifying over 1.3 million constrained elements spanning over 7% of the human genome with high confidence. Compared to previous methods, we predict a larger fraction of the human genome to be contained in constrained elements due to the annotation of many fewer but longer elements, with a very low false positive rate.
Like other bottom-up approaches, the GERP++ algorithm consists of two components: calculation of position-specific constraint scores for each column of a multiple alignment, and subsequent aggregation of neighboring columns into segments that score significantly higher than expected by chance (
(1) For each position of the multiple alignment we compute the conservation score in rejected substitutions by subtracting the estimated evolutionary rate from the neutral rate. The neutral rate is computed by removing species gapped at that position from the phylogenetic tree and summing the branch lengths of the resulting projected tree; the evolutionary rate is estimated by computing the maximum likelihood rescaling of the projected tree. (2) Given position-specific conservation scores, we generate a set of candidate elements. (3) For each candidate element, we compute a p-value to represent the likelihood of observing a segment of equal length and greater than or equal score under the null model. We then select a non-overlapping set of elements in order of increasing p-value.
Constraint intensity at individual alignment positions is quantified in terms of “rejected substitutions” (RS), defined as the number of substitutions expected under neutrality minus the number of substitutions “observed” at the position
Then, in the element-finding step, GERP++ uses the position-specific RS scores to generate a set of candidate elements. For each putative element it computes a p-value based on the element's length and score (defined as the sum of RS scores for each position within the element) that represents the probability of observing such an element in the null model. These p-values are used to rank CEs in order of significance and report a set of non-overlapping predictions, starting with the lowest (best) p-value. Rather than applying a fixed cutoff, GERP++ estimates the false positive rate by randomly permuting the input RS-scores and treating any prediction within the shuffled sequence as a false positive, similar to the first version of GERP
We used GERP++ to analyze the TBA alignment of the human genome to 33 other mammalian species (the most distant mammalian species is Platypus) spanning over 3 billion positions with a phylogenetic scope of 5.83 substitutions per neutral site. We identified 1,354,034 constrained elements covering 214,749,502 nucleotides, or approximately 7% of the human genome, with an estimated false positive rate of 0.86% at the nucleotide level (see
We observe significant variation among entire chromosomes of both average RS score and fraction of positions predicted to belong to constrained elements (
(A) Mean RS score for all alignment positions where evolutionary rate was computed. Note the elevated average score for chromosome X. (B) Fraction of chromosome that falls into predicted constrained elements. Light green bars show fraction of entire chromosome, while dark green bars show fraction adjusted for regions where no rate computation was performed and no elements could span (see
The only major parameter for GERP++ is a false positive rate cutoff that determines at what point the algorithm should stop generating predictions in order to avoid too many false discoveries. Throughout its execution GERP++ keeps track of the constrained elements predicted so far, as well as estimates of the number and total size of false positive predictions for the specified cutoff level. Examining how these quantities grow as the cutoff parameter increases permits us to estimate the amount of total constraint that can be detected using this methodology and give an approximate upper bound on the amount of constraint within the human genome.
Let B(
The red curve represents the number of bases within predicted constrained element as a function of the false positive cutoff parameter. The blue curve represents the number of predicted bases minus the expected number of false positive bases, also as a function of the false positive cutoff.
We next examine the relationship between evolutionary constraint and several classes of biologically important regions. Overall, coding exons exhibit by far the strongest levels of constraint, as quantified both by the average RS score within functional elements (
(A) Mean rejected substitution scores for entire human genome, constrained elements predicted by GERP++, and known annotated exons, introns, and UTR regions. (B) Breakdown of constrained element positions by region type.
Annotation | % Coverage by CEs |
Exons | 84.6% |
Introns | 6.9% |
UTR5′ | 23.7% |
UTR3′ | 33.9% |
ncRNA | 10.1% |
Over 94% of the coding exons in the human genome overlap at least one predicted CE; conversely, only about 16% of constrained elements overlap a coding exon. CEs that overlap exons are on average ∼60 nucleotides or 40% longer, and consequently have more than two-fold higher scores, than elements that do not overlap exons (both t-tests significant at p-value<2.2·10−16). While overall these results are consistent with what was observed using the previous version of GERP
To further test this hypothesis, and to investigate a potentially useful signal for detecting coding exons, we introduce a metric that rigorously quantifies this pattern of constraint for any region. For any given segment, we define the 3-periodicity bias as the maximum over the 3 possible reading frames of the mean RS score at positions 1 and 2 minus the mean RS score at position 3. This metric quantifies a periodic bias in constraint and effectively deals with unknown reading frame location and lack of a reading frame altogether, since the maximum is taken over all 3 possibilities. As
Type | Mean 3-periodicity Bias |
Exons | 2.96 |
5′ UTR | 0.57 |
3′ UTR | 0.32 |
Introns | 0.18 |
CEs overlapping exons | 2.46 |
CEs not overlapping exons | 0.55 |
We compared the GERP++ constrained element predictions in placental mammals (see
(A) Mean length (left), number (middle) and total length (right) of constrained elements predicted by GERP++ (blue) and phastCons(yellow). (B) Nucleotide-level fraction of annotated exons, introns, UTRs and noncoding RNAs genes covered by GERP++ (blue) and phastCons (yellow) predictions. (C&D) Histogram of number of distinct predicted GERP++ (blue, D) and phastCons(yellow, C) constrained elements overlapping each annotated coding exon. Note the difference in scale on the y-axis. (E) A constrained region slightly over 200 base pairs in length that contains a known exon, as annotated by GERP++ (labeled ‘GERP++’, black) and phastCons (purple track labeled ‘Mammal El’). Note how phastCons fragments the exon into multiple CE predictions.
Part of the reason for these differences is that often phastCons predicts multiple elements where GERP++ makes one longer prediction. PhastCons thus skips intermediate positions which may be under weaker constraint yet still part of one large functional element, as the example in
Due in part to its ability to annotate larger elements in one piece, GERP++ is more effective at predicting constraint within several types of known functional regions. At the nucleotide level GERP++ elements cover a substantially larger fraction of several major types of functional elements, especially coding exons and UTRs (
One of the main challenges in constrained element detection is the lack of a clear gold standard for evaluating the quality of predictions. Human functional elements are sometimes unconstrained at the mammalian scope or missed at the assembly or alignment stages, and CE predictions that do not correspond to any known annotations may have unknown function, and cannot be definitively considered false positives. Given these limitations, we have shown that GERP++ offers several advantages over its predecessor GERP and makes fewer assumptions about the shape of conservation than previous approaches such as PhastCons. Previous bottom-up approaches have been limited largely by the simple heuristics used to merge constrained positions into longer elements; these heuristics may introduce biases in element length due to patterned constraint such as the 3-periodicity in coding exons. With GERP++ we evaluate a much richer set of candidate elements, selecting and ranking final predictions according to statistically meaningful p-values.
Despite the added computational cost at this stage, GERP++ overall is more than 100 times faster than GERP due to the speedup in rate estimation. Because GERP++ estimates a single parameter that directly translates into evolutionary rate, rather than an independent parameter for each branch of the tree, the computation is not only faster but also results in more statistically robust estimates as alignment depth increases. GERP++ takes a few days on a typical machine or a few hours on a small cluster to complete an analysis of the human genome aligned to 33 mammalian species, and can scale to virtually any reasonable genome size and alignment depth.
Our understanding of the evolutionary forces constraining sequence variation is still limited, especially in noncoding regions. This presents a challenge for generative model-based approaches, which model implicitly or explicitly the distribution of length and intensity of constrained elements and the total genomic fraction under constraint. In contrast, rate estimation and element prediction in GERP++ are largely independent procedures, and while GERP's rejected substitution metric
One drawback of GERP++ and other similar approaches is sensitivity to variation in and erroneous estimates of the neutral rate of substitution. Neutral rate estimates are often subject to some uncertainty and can vary depending on the methodology, alignment quality, and genomic region. To test the ability of GERP++ to tolerate a reasonable amount of error in neutral rate estimates, we repeated our analysis with the neutral tree scaled up or down by 5 or 10%. Not surprisingly, overestimating the neutral rate leads to overprediction of constraint, and vice versa. For a fixed false positive cutoff, we observed a linear relationship between the input neutral rate and the amount of constrained element bases predicted; a 5/10% change in neutral rate leads to approximately 8/15% change in the number of predicted constrained bases.
It is important to note that our false positive rates and p-values are computed based on the implicit assumption that the score distribution is homogeneous within a region and all sites are independent. While this assumption has been present in previous approaches that also relied in permuted alignments for false positive rate estimation, it is central to the GERP++ p-value computation. Finally, the greedy manner of resolving candidate element overlap conflicts by smallest p-value presents another potential limitation, as for elements with equal average constraint this will break ties in favor of the longer element. This may or may not be biologically meaningful, especially if complicated conservation patterns are involved or two strongly conserved functional elements are very close together (and the segment between them is at least somewhat constrained). These hypothetical effects are likely mitigated by GERP++'s position-specific scores, which enable higher resolution analysis within individual CEs, and which ultimately may be the criterion upon which to decide whether any particular long element may better be regarded as two shorter ones.
GERP++ recapitulates known biology, at both the nucleotide level and on the scale of entire functional elements and even chromosomes. GERP++ scores are accurate enough to obtain a strong signal of synonymous substitution in coding exons, and the elevated average RS score for chromosome X (
Computationally, GERP++ is efficient enough to perform whole-genome analysis of deep mammalian alignments within a few cpu-days, making it suitable for high-throughput analysis of the ever increasing amounts of genomic data. We hope GERP++ will prove to be a useful tool in analyzing, quantifying, and annotating constraint and discovering novel functional elements in the human and other genomes for which sufficient comparative data exist.
GERP++ is available at
Given a multiple sequence alignment and a phylogenetic tree with branch lengths representing the neutral rate between the species within that alignment, GERP++ quantifies constraint intensity at each individual position in terms of rejected substitutions
For each individual alignment column GERP++ labels the leaves of the phylogenetic tree with the corresponding nucleotides c1, …, ck; gapped species are projected out. Although this is not necessarily ideal and sometimes leads to information loss, it avoids some of the common difficulties and potentially serious biases that accompany modeling gaps in alignments: aligner errors and artifacts that result from simplified gap penalties and incorrect handling of duplications and rearrangements, assembly mistakes, and missing sequence data. Furthermore, this treatment of gaps avoids explicitly penalizing constrained elements that have undergone lineage-specific deletion
Once the gapped species are removed, the site-specific neutral rate is computed as the sum of the branch lengths in the trimmed tree. When there are fewer than 3 species remaining no rate estimation is performed for that position, as there are not enough species to even form a valid tree. We estimate by maximum likelihood a homogeneous scaling factor of the neutral tree at each position; similar but independently developed methods were used for rate estimation in
Since the leaf nucleotides are observed, this equation can be used to compute the subtree probability for all internal nodes, starting at the bottom and reaching the root, where we can compute L(
Using this algorithm as a subroutine to calculate L(
Given position-specific constraint scores, GERP++ generates a list of elements that exhibit evidence of evolutionary constraint beyond what is likely to occur by chance. For each element, we compute a p-value that represents the probability of a random neutral segment of equal length having an equal or higher RS score. In addition to being used to select final predictions from the set of candidate elements, these p-values in conjunction with position-specific scores provide useful information for biological analysis.
Every segment of contiguous multiple alignment columns is a candidate element. Because considering all possible segments within the alignment is computationally infeasible, GERP++ generates a list of candidate elements using several simple biological heuristics to prune the possibilities. First, we impose a user-specified minimum and maximum on candidate element length; while real functional elements vary in length, very few extend beyond several thousand bases, and even these will not be missed entirely as GERP++ will identify their most constrained parts. Second, since positive RS scores indicate constraint, GERP++ allows only candidate elements that start and end at positions with RS≥0 and cannot be extended further in either direction; this rule has the additional benefit of imposing sensible boundary conditions on predicted elements. Finally, we only consider candidate elements with score above a certain value, which is a function of the element length and the median neutral rate of the region. This allows pruning of candidate elements that have low scores relative to their lengths, and since they will end up with poor p-values anyway ignoring them early reduces the memory requirements considerably.
Using neutrality as the null hypothesis, we can now define p-values for candidate and predicted elements on the basis of score and length. If the probability of a single neutral position having RS score x is given by P(x), then for an element of length L and score S the p-value is the probability of having score at least S in exactly L positions, and is given by:
Once all candidate elements have been assigned p-values, GERP++ selects elements in a greedy manner, from smallest to highest p-value, discarding any elements that overlap previously reported elements. As the p-value increases so does the expected false positive rate of our predictions; when this reaches a user-specified threshold the algorithm terminates. While it would be ideal to compute this directly from the p-values, the multiple hypothesis correction in this case is non-trivial because GERP++ reports a non-overlapping set of predictions. Therefore, we adopt the approach of Cooper et al
TBA
To limit memory requirements and allow parallelization of the constrained element computation, each chromosome was broken up into regions of approximately 2 megabases, with long segments where no RS score was computed chosen as boundaries. These boundary segments contain no information usable by GERP++ and because the algorithm never annotates constrained elements spanning them, excluding such segments did not sacrifice any predictive ability. These boundary regions made up approximately 6.8% of the human genome, including a 30.2 megabase region that made up more than half of chromosome Y. Constrained element predictions were generated using default parameters and a 5% false positive cutoff measured in terms of number of predictions; the estimated nucleotide-level false positive rate was under 1%. As additional validation, we computed overlap between our predictions and a set of ancestral repeats (L2) annotated by RepeatMasker. We found the overlap to be in line with what we expected given our estimated false positive rates: about 5% of the repeats overlap a predicted CE, with around 1.6% nucleotide-level overlap.
Gene, noncoding RNA, and PhastCons conserved element annotations were obtained from the UCSC genome browser's
PolII binding regions were defined as 50 bp upstream and downstream of PolII binding ‘peaks’ as identified from ChIP-seq experiments performed by the ENCODE Consortium
Phylogenetic tree used for GERP++ analysis. Tree is drawn to scale with respect to estimated neutral branch lengths.
(0.12 MB PDF)
Distribution of constrained element lengths. (A) GERP++. (B) PhastCons.
(0.15 MB PDF)
Distribution of GERP++ RS scores for 2Mb region of chromosome 1, excluding shallow (neutral rate<0.5) positions.
(0.01 MB PDF)
We thank Anshul Kundaje for his help in obtaining PolII binding site data from the UCSC browser. We also thank the anonymous reviewers for their helpful feedback and suggestions.