The authors have declared that no competing interests exist.
Discussion on algorithm development: JE MK BR. Advice on computational validations: WW. Overall guidance: BR. Conceived and designed the experiments: WX JS. Performed the experiments: YL. Analyzed the data: NR UW. Wrote the paper: NR.
Transcriptional enhancers play critical roles in regulation of gene expression, but their identification in the eukaryotic genome has been challenging. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of cell types or chromatin marks have previously been investigated for this purpose, leaving the question unanswered whether there exists an optimal set of histone modifications for enhancer prediction in different cell types. Here, we address this issue by exploring genome-wide profiles of 24 histone modifications in two distinct human cell types, embryonic stem cells and lung fibroblasts. We developed a Random-Forest based algorithm, RFECS (
Enhancers are regions in the genome that can activate the expression of a gene irrespective of their location with respect to the gene. Identifying these elements is critical in understanding regulatory differences between different cell-types. Since enhancers lack characteristic sequence features and can be far away from the gene they regulate, their identification is not trivial. Experimentally determining the genome-wide binding sites of transcriptional co-activator p300 is one way of finding enhancers but it can only identify a subset of enhancers. A few years ago, it was observed that the binding sites of p300 are marked by distinctive, post-translational histone modifications. Several groups have exploited this discovery to predict genome-wide enhancers based on their similarity to the histone modification profiles of p300 binding sites. We here report a novel algorithm for this purpose and show that it has much greater accuracy than existing methods. Another unique feature of our algorithm is the ability to automatically deduce the most informative set of histone modifications required for enhancer prediction. We expect that this method will become increasingly useful with the expanding number of known histone modifications and rapid accumulation of epigenomic datasets for various cell types and species.
Enhancers are distal regulatory elements with key roles in the regulation of gene expression. In higher eukaryotes, a diverse repertoire of transcription factors bind to enhancers to orchestrate critical cellular events including differentiation
Recently, several high-throughput experimental approaches have been developed to identify enhancers in an unbiased, genome-wide manner. The first is mapping the binding sites of specific transcription factors by ChIP-seq
Previously, we and others observed that distinct chromatin modification patterns were associated with transcriptional enhancers
Some researchers have tried to tackle this issue by using algorithms such as simulated annealing
As part of the NIH Epigenome Roadmap project, we have generated genome-wide profiles for 24 chromatin modifications and DNase-I hypersensitivity sites in 2 distinct cell types- human embryonic stem cell (H1) and a primary lung fibroblast cell line (IMR90)
Random forests have recently become a popular machine learning technique in biology
Genome-wide distal p300 binding sites were found using ChIP-seq in H1 and IMR90 cell-lines. We selected p300 binding sites overlapping DNase-I hypersensitive sites and distal to annotated TSS as active p300 binding sites representative of enhancers. We found 5899 such p300 binding sites in H1 and 25109 such sites in IMR90 (
A.)Chromatin states for p300 binding sites in H1 cells. B.)Chromatin states for p300 binding sties identified in IMR90 cells, identified by clustering using ChromaSig
To train the forest, active and distal p300-binding sites (BS) were selected as representative of the enhancer class. As non-enhancer classes, we considered annotated transcription start sites (TSS) that overlap DNase-I, and random 100 bp bins that are distal to known p300 or TSS (see Methods). The confidence of each enhancer prediction is given by the percentage of trees that predict this site to be an enhancer. In general, a genomic region is predicted as an enhancer if it has a background cutoff greater than 0.5 (>50% trees vote in it's favor). At higher cutoffs, confidence of prediction is higher, but fewer enhancers are predicted.
We used Receiver Operating Characteristic (ROC) curves to determine optimal parameters for our classification algorithm
In the case of Random forests, the main parameter to be determined is the number of trees. Since the non-enhancer class is assumed to be several times enriched compared to the enhancer class in the genome, we select a greater number of non-p300 training sites as compared to p300 sites and this proportion is also adjusted using the above-described methods. Previous algorithms
To determine the optimal number of trees for the random-forest, we examined the area under the ROC curve in H1 and IMR90 and found both to be stable beyond 45 trees (
Area under the 5-fold cross-validated ROC curve decreases with increase in number of trees stabilizing gradually in A.)H1 and B.)IMR90 cells. C.)Validation Rate of enhancer predicted in H1 cells, as measured by overlap with DNase-I HS and binding sites of p300, NANOG, OCT4 and SOX2. D.)Misclassification Rate of enhancer predicted using RFECS in H1 as measured as overlap of UCSC TSS, E.)Validation Rate of enhancers predicted by RFECS in IMR90 as measured by overlap with DNase-I HS or p300 binding sites in the same cells. F.)Misclassification Rate of enhancers predicted by RFECS in IMR90 as measured by overlap with UCSC TSS, versus total number of enhancers (upto 40000 enhancers) determined by taking different enrichment cutoffs, are shown for forest trained in the same cell type (⋅red), forest trained in other cell type and predictions made on modifications with averaged RPKM (⋅black), replicate 1 only (⋅blue), and replicate 2 only (⋅green). Training on one replicate and prediction on the other replicate of the same cell-type are indicated by asterisks.
In order to estimate the accuracy of the enhancer prediction by RFECS, we applied this algorithm to chromatin profiles of 24 marks obtained in H1 and IMR90. We then calculated the validation rate as the percentage of predicted enhancers overlapping with DNase-I hypersensitivity sites and binding sites of p300 and a few sequence specific transcription factors known to function in each cell type (true positive markers). We also computed the misclassification rate as the percentage of predicted enhancers overlapping with known promoters. These overlaps were computed using a window of −2.5 to +2.5 kb. Incase, both a true positive marker as well as promoter lay within this window, the criteria used to decide if the enhancer was “validated” or “misclassified” is discussed in detail in the Methods section. In H1 cells, we obtained a total of 55382 predicted enhancers at the lowest voting cutoff of 0.5. Over 80% of these predicted enhancers overlap with distal DNase-I hypersensitive sites and the binding sites of p300, NANOG, OCT4 and SOX2. Upon randomly generating enhancer predictions in the H1 genome 100 times, we found the average validation rate to be 18.43% and the actual validation rate of 80% to be highly significant with a one-sided t-test p-value of 10∧-256. Additionally, we found that 5% of them overlap with UCSC TSS, indicating a low misclassification rate of 5% (
We next tried to assess the linear resolution of RFECS predictions. We calculated the distance between the predicted enhancers and locations of enhancer markers such as DNase-I hypersensitive sites, or p300 binding sites in each cell type, and found that the majority of predicted enhancers are within 200 bp of these sites (
We also confirmed that our enhancer predictions showed an activation of gene expression in the proximal TSS. In order to do this, we compared RNA-seq datasets (Wei Xie et al., manuscript under revision) in H1 and IMR90 using edgeR
As further evidence that RFECS accurately predicts enhancers, chromatin modifications at the predicted enhancers showed presence of all chromatin states observed in the training sets comprised of a subset of distal p300 binding sites (
In summary, we showed that RFECS accurately predicted enhancers in the two cell lines H1 and IMR90 using a set of 24 chromatin modifications. These enhancers showed high validation rates, low misclassification rates and sharp linear resolution.
To make enhancer predictions, our approach requires a construction of a random forest trained on promoter-distal p300 binding sites. It is time-consuming and expensive to create a new training set for enhancer prediction in each new cell type, so it is desirable to use a random forest developed in one cell type to predict enhancers in another. To evaluate the feasibility of such approach, we first trained a random-forest using chromatin modification profiles obtained in H1, and then applied it to the IMR90 cells. Compared to RFECS predictions using IMR90 chromatin profiles as training set, RFECS predictions using H1 training dataset reduces the validation rate by ∼5–8% and increases the misclassification rate by ∼2% (
We sought to examine if this moderate decrease in performance was largely due to cell-type specific differences or was within the limits of technical or biological variability between replicates. To this end, we trained a random forest on one replicate of a cell-type, and made predictions on the other replicate of the same cell type. RFECS trained on IMR90 and then applied to the replicate 1 of the H1 profiles (blue dot vs asterisk) actually showed a higher validation rate and lower misclassification rate than RFECS trained using replicate 2 of H1 (
With the increasing number of histone modifications being discovered and mapped, determination of the relative importance of each mark in defining genomic elements is important. An out-of-bag measure of variable importance is a natural by-product of random forest classification scheme
The average variable of histone modifications across 5 cross-sections of data in 2 sets of replicates as well as averaged replicates using all 24 modifications in A.)H1 and B.)IMR90 cells. Out-of-bag variable importance was calculated from the random-forest based classification of p300 binding sites against TSS+genomic background. Robust appearance of H3K4me1, H3K4me3 and H3K4me2 among the most important marks across replicates and cell types, indicates these may form a minimal set for prediction of enhancers. Differences observed in correlation clustering of the same 24 modifications in C.)H1 and D.)IMR90 explain some of the differences in ordering of variables in the two cell types. Same non-black colors of modifications indicate clusters that co-occur in both cell-types.
Beyond the top 3 modifications, there is variability among the cell types. In IMR90, the other modifications appear to contribute almost equally, while in H1 there is a much clearer difference in variable importance. These differences are supported by correlation analyses in H1 and IMR90 (
Having established the relative importance of each histone modification in predicting enhancers, we next examined the accuracy of predictions using different sets of modifications. Validation rates obtained by using the minimal set of H3K4me1-3 is within 2% of that for all 24 modifications in H1 (
A.) Validation Rate in H1 measured by overlap with DNase-I HS, p300, NANOG, OCT4 or SOX2, B.) Misclassification Rate in H1 measured as overlap of UCSC TSS, C.) Validation Rate in IMR90 measured by overlap with DNase-I HS or p300, D.) Misclassification Rate in IMR90 measured as overlap of UCSC TSS, versus total number of enhancers determined by taking different enrichment cutoffs, are shown for all 24 modifications (red), predicted minimal set of H3K4me1/H3K4me2/H3K4me3 (green) and conventionally used marks H3K4me1/H3K4me3 (black) or H3K4me1/H3K4me3/H3K27ac (blue). E.) Comparison of average validation rates for enhancer predictions using all combinations of 3 histone modifications for 2 replicates of H1.
It can also be observed that in conjunction with H3K4me1 and H3K4me3, using H3K4me2 picks up a larger proportion of enhancers with weaker acetylation enrichment as compared to H3K27ac (
We also made enhancer predictions using all possible combinations of 3 modifications in chromosome 1 for replicate 1 and replicate 2 of H1. The average validation rate for a fixed range of enhancers was compared across replicates and it can be seen the set corresponding to H3K4me1, H3K4me2 and H3K4me3 (marked in *), is the highest performing combination common to both replicates (
In many currently existing datasets, H3K27ac is the more commonly sequenced histone modification as compared to H3K4me2 due to it's perception as a marker of active enhancers. While using H3K4me2 may improve enhancer prediction in some cell-types, use of H3K27ac in addition to H3K4me1 and H3K4me3 marks does show considerable improvement over using just the top 2 marks H3K4me1 and H3K4me3 (
Overall, these comparisons indicate the suitability of selecting H3K4me1, H3K4me2 and H3K4me3 as three minimal chromatin marks for purposes of enhancer prediction. Additional chromatin modifications required for improving upon enhancer predictions may depend on cell-type specific characteristics, as indicated by the differences in variable importance between H1 and IMR90 (
We next asked if our enhancer prediction algorithm performed better than several other current techniques for enhancer prediction – CSIANN, ChromaGenSVM and Chromia
A.) In CD4. True positive rates were measured as overlap with either DNase-I hypersensitive sites (DHS), p300 or CBP binding sites, while false positives were measured as overlap with UCSC TSS. B.) In H1. True positive rates were measured as overlap with either DNase-I hypersensitive sites (DHS), p300 or transcription factor binding sites such as NANOG, OCT4 and SOX2, while false positives were measured as overlap with UCSC TSS.
To compare these different sets of enhancer predictions, we computed validation rates by comparing them to TSS-distal DNase-I hypersensitive sites, p300 binding sites, and CBP binding sites and misclassification rates by comparing to known UCSC TSS using a window of −2.5 kb to +2.5 kb as described in the methods. (
In the above comparison, we selected our enhancer-representative training set as p300 peaks called using MACS
Comparing enhancer predictions across diverse cell-types can contribute to understanding differences in regulatory mechanisms between cell-types. The ENCODE dataset is an example of a collection of high-throughput datasets such as histone modifications and transcription factor binding data that are available for multiple cell-types
We trained our random forest on the p300 ENCODE data in H1 and made enhancer predictions in 12 ENCODE cell-types using the three marks H3K4me1, H3K4me3 and H3K27ac since these were available for all the cell-types. Validation rates were assessed based on overlap with existing DNAse-I hypersensitivity data while misclassification rates were calculated based on overlap with UCSC TSS. It can be seen that the majority of cell-types show high validation rates between 80 and 95%, while the misclassification rates lie within acceptable levels of 2–7% (
A.)Validation Rate in the 12 cell-types measured by overlap with DNase-I HS, B.)Misclassification Rate in the cell-types measured as overlap of UCSC TSS, C.)Average false discovery rate (FDR) over the 22 autosomal chromosomes for each cell-type plotted as a function of voting percentage of trees, D.)Validation rate and misclassification rate for each cell-type at a FDR of 5% with number of enhancer predictions shown above the bar.
In order to compare enhancers across cell-types, it is preferable to have enhancer predictions with the same level of confidence. To determine the appropriate cutoff for multiple number of cell-types, we calculate a False Discovery rate by randomly permuting 100 bp bins across the genome and computing the ratio of enhancers predicted in permuted data/enhancers predicted in real data for various cutoffs of voting percentages. In
Using an FDR of 5%, we obtained a consistent set of high-confidence enhancer predictions in the 12 ENCODE cell-types. In
In summary, we obtained a high-confidence set of enhancer predictions in multiple ENCODE cell-lines with the same level of confidence. This will enable more rigorous comparisons of regulatory characteristics of these cell-types in the future.
We describe here a novel machine-learning algorithm to accurately predict enhancers in a genome-wide manner based on chromatin modifications. We trained this algorithm using novel p300 training sets in H1 and IMR90 and 24 chromatin modifications in each cell-type. We showed that models trained on one cell-type could be effectively applied on another cell-type. Random forests enable detection of the most informative features required for a classification task. In the case of enhancer prediction, we identified a set of 3 histone modifications that appeared to be the most informative and robust across cell-types and replicates. Such an approach can once again be applied when the number of genome-wide modification maps is expanded in various different cell types and the most informative set of modifications can be further refined. We show that RFECS outperforms other machine-learning based prediction tools in CD4+ T cells, and can be applied in the future to multiple cell types. We successfully applied our enhancer prediction tool to 12 cell-lines in the publicly available ENCODE database and obtained a set of enhancers with a consistently high level of confidence across the cell-types.
In the future, we could potentially adapt the RFECS method to detect other regulatory genomic elements that can be observed to have a distinct chromatin signature and find the minimal set of chromatin marks for this purpose. The ability to detect diverse patterns of features within the training set indicates that the RFECS approach could be used to train on a composite training set comprised of different transcription factors. Combining information from different enhancer-binding proteins may improve prediction of regulatory elements. Random forests are non-parametric and have been shown to integrate a large number of diverse features. This could suggest the addition of other discrete and continuous data types such as sequence or motif based information or DNA methylation to the prediction of genomic elements.
The H1 and IMR90 datasets used in this study were generated as part of the NIH Roadmap Epigenome Project and have been released to the public prior to publication (
For CD4, previously generated datasets for p300
The ChIP-seq reads for the histone modification as well as corresponding input were binned into 100 bp intervals. The binned modification file was normalized against the binned input file using an RPKM (Reads per kilobase per million) measure
MACS
We constructed the forest using the concept of binary classification trees, with each feature being a 20-dimensional vector of 100 bp bins from −1 to +1 kb along the genomic element. At each node in the tree, a linear classifier was constructed using the Fischer Discriminant approach using the histone modification vector, allowing for utilization of shape as well as abundance information (
Enhancer prediction involved two stages, which are classification of p300 vs non-p300 and peak-calling.
Classification of p300 vs non-p300 for enhancer prediction purposes
i. Training
In the first stage, a forest was constructed with two classes – a class containing p300 binding sites and a second class with an equal number of TSS and x times the number in random background sequences, where x = 9 for CD4 and x = 7 for H1 and IMR90.
ii. Prediction
In order to make predictions, each 100 bp bin along a chromosome is assigned either enhancer or non-enhancer status. The output from the forest is in the form of percentage of trees predicting a 100 bp bin to be one element or another. Only bins that have >50% trees voting for the enhancer class, are considered for further analysis.
Peak-calling
Using the random forest previously trained to predict whether a 100 bp bin along a chromosome is an enhancer or not often yields values >50% for regions on either side of the exact location of a p300-binding site. However, the percentage of trees voting in favor of p300 decreases symmetrically on either side of the actual peak (
A major advantage of the random forest is the inherent ability to select more important variables versus less important ones. In order to compute the order of variable importance, in this case, the importance of individual histone modifications for making enhancer predictions, we use an out-of-bag measure of variable importance
Based on the ordering of the variable importance across 5 different cross-sections of the training dataset of multiple replicates and cell types, certain modifications may always be observed to have priority. Due to the non-redundant nature of the ordering of variables as well as their robustness across replicates and samples, these modifications maybe selected as the most informative ones that are required to make enhancer predictions.
Cross-validated ROC curves were used to estimate parameters for use within the same algorithm. However, comparisons across different algorithms may be biased depending upon the composition of the training set, so we validated enhancer predictions as described below.
Enhancer Predictions outputted from the random forest predictor have background enrichment scores of “voting percentage” ranging from 0.5 to 1 to enable detection of enhancers at different levels of confidence. At higher cutoffs, confidence of prediction is higher, but fewer enhancers are detected. The availability of large-scale datasets such as DNase-I hypersensitive sites, p300 binding sites, CBP binding sites and transcription factor binding sites enabled an estimate of the number of true positives at every cutoff. Further, the number of enhancers misclassified as TSS at each cutoff was also determined. Within the same cell type, an enhancer prediction method that performs better, should pick up more true positive validation markers and fewer TSS, given the number of predictions are the same.
Predicted enhancers are classified as “validated”, “misclassified” or “unknown” based on the criteria below. True Positive Markers (TPM) refer to DNase-I hypsersensitivity site, p300, CBP and Transcription factor binding sites.
If the nearest TPM lies within 2.5 kb of the enhancer and the nearest TSS is greater than 1 kb away from the TPM, the enhancer is “validated”
If a TSS lies within 2.5 kb of the enhancer, and the nearest TPM is either greater than 2.5 kb away from the enhancer or within 1 kb of the TSS, the enhancer is “misclassified”
If there is no TPM or TSS within 2.5 kb of the enhancer, it is “unknown”.
The Pearson correlation coefficient between any two modifications was computed for RPKM-normalized histone modification reads between −1 to +1 kb for all elements within the selected training set. The correlation patterns of each histone modification was used to cluster the modifications and order them using MATLAB tools.
This enabled visualization of which modifications are the most similar in their correlation patterns. In the ordering of variable importance, if certain variables showed up as important in two different cell types, the redundancy based on their correlation plots could be used to explain away this variability.
ChromaSig
(EPS)
(EPS)
(EPS)
(TIF)
(TIF)
(EPS)
(EPS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(TXT)
(PDF)
(PDF)
We thank Gary Hon and Feng Yue for valuable comments on this manuscript. We thank Fulai Jin for sharing unpublished data (GSM929091). We also acknowledge the contribution of Michael Fernandez Llamosa for running ChromaGenSVM on our H1 data.