Conceived and designed the experiments: MS DB DIKM LP. Performed the experiments: MS JD AS GPS DIKM. Analyzed the data: MS DB JD AS DIKM LP. Contributed reagents/materials/analysis tools: GPS. Wrote the paper: MS DB DIKM LP.
The authors have declared that no competing interests exist.
The ability to assay genome-scale methylation patterns using high-throughput sequencing makes it possible to carry out association studies to determine the relationship between epigenetic variation and phenotype. While bisulfite sequencing can determine a methylome at high resolution, cost inhibits its use in comparative and population studies. MethylSeq, based on sequencing of fragment ends produced by a methylation-sensitive restriction enzyme, is a method for methyltyping (survey of methylation states) and is a site-specific and cost-effective alternative to whole-genome bisulfite sequencing. Despite its advantages, the use of MethylSeq has been restricted by biases in MethylSeq data that complicate the determination of methyltypes. Here we introduce a statistical method, MetMap, that produces corrected site-specific methylation states from MethylSeq experiments and annotates unmethylated islands across the genome. MetMap integrates genome sequence information with experimental data, in a statistically sound and cohesive Bayesian Network. It infers the extent of methylation at individual CGs and across regions, and serves as a framework for comparative methylation analysis within and among species. We validated MetMap's inferences with direct bisulfite sequencing, showing that the methylation status of sites and islands is accurately inferred. We used MetMap to analyze MethylSeq data from four human neutrophil samples, identifying novel, highly unmethylated islands that are invisible to sequence-based annotation strategies. The combination of MethylSeq and MetMap is a powerful and cost-effective tool for determining genome-scale methyltypes suitable for comparative and association studies.
In the vertebrates, methylation of cytosine residues in DNA regulates gene activity in concert with proteins that associate with DNA. Large-scale genomewide comparative studies that seek to link specific methylation patterns to disease will require hundreds or thousands of samples, and thus economical methods that assay genomewide methylation. One such method is MethylSeq, which samples cytosine methylation at site-specific resolution by high-throughput sequencing of the ends of DNA fragments generated by methylation-sensitive restriction enzymes. MethylSeq's low cost and simplicity of implementation enable its use in large-scale comparative studies, but biases inherent to the method inhibit interpretation of the data it produces. Here we present MetMap, a statistical framework that first accounts for the biases in MethylSeq data and then generates an analysis of the data that is suitable for use in comparative studies. We show that MethylSeq and MetMap can be used together to determine methylation profiles across the genome, and to identify novel unmethylated regions that are likely to be involved in gene regulation. The ability to conduct comparative studies of sufficient scale at a reasonable cost promises to reveal new insights into the relationship between cytosine methylation and phenotype.
New methods that assay epigenetic modifications over the whole genome promise to reveal insights into cell differentiation and development
Cytosine methylation, which in vertebrates is mostly confined to CG dinucleotides, cooperates with other epigenetic modifications to suppress transcription initiation
High-throughput sequencing technologies have catalyzed the development of new methods for measuring DNA methylation. These methods can be broadly classified as
Site specific | Pre-chosen regions | Spanning of human genome | Spanning of CpG islands | #CG sites | Bisulfite conversion | Read length | Constraints on analysis | Comparable with low amounts of input DNA | |
MethylSeq | Yes | No | 9.2% | 92.9% | Not Needed | 32bp suffice | Inference Procedure Needed | Yes | |
RRBS | Yes | No | 8.1% | 69.8% | Needed | Longer = more coverage | Low Sequence Complexity | Yes | |
Affinity-based (MeDIP, mDIP, mCIP)-Seq | No | No | whole genome | all | - | Not Needed | 32bp suffice | Binding Biases | No |
Affinity-based (MeDIP, mDIP, mCIP)-Array | No | Yes | pre-chosen | pre-chosen | - | Not Needed | - | Binding Biases+Array Biases | No |
For definition of spanning and determination of number of sites, see
MethylSeq is a convenient methyltyping strategy because it is cost-effective, requires only small amounts of material, and avoids bisulfite conversion. Briefly, the assay works by digestion of DNA with a methylation-sensitive enzyme (HpaII) that cuts unmethylated CGs at CCGG sites. Subsequent sequencing and mapping to the genome reveals unmethylated CGs (
Genomic DNA is digested with the methylation-sensitive restriction enzyme HpaII. Unmethylated HpaII sites (open circles) are digested and thus found at the ends of restriction fragments, while methylated HpaII sites (black circles) are not digested. Restriction fragments are size-selected according to the Illumina protocol; fragments that are either too long or too short are removed. Fragments that pass the size-selection are used to construct sequencing libraries. After sequencing, the raw reads are aligned against the reference genome and processed with MetMap to derive maps of genome-scale methylation.
In order to make effective use of MethylSeq for genome-scale methyltyping we developed a freely available program, called MetMap, that infers methylation at individual CGs by modeling biases inherent in MethylSeq experiments. An additional important feature of MetMap is the annotation of strongly unmethylated islands (SUMIs) which, as opposed to the current definition of CpG islands, incorporate information from both a reference sequence and genome-scale methylation data. We have validated MetMap's site-specific analysis, as well as its unmethylated-island annotation, with bisulfite sequencing of specific sites.
We demonstrate the use of MethylSeq with MetMap by methyltyping four male human individuals, and annotating their unmethylated islands. We show that the picture revealed by such analysis is sufficient to survey methylation states across the genome. Such analysis gives significant insight into the methylome of each specimen, inside and outside of CpG islands, at site specific resolution. We show evidence that the mean extent of methylation of an island is more informative than the methylation state of the different sites in the island, because the correlation between the methylation states of any two samples improved when considering the mean. MetMap identifies numerous unmethylated regions, of varying lengths, which have not previously been annotated as CpG islands and are associated with other features indicative of transcriptional function. We conclude that MetMap leverages the cost-efficiency and practical ease of MethylSeq to produce informative genome-scale methylation annotations (methyltypes) that are suitable for both region- and site-specific comparative and case-control studies.
The remainder of this paper is organized as follows. We begin by explaining in detail significant biases present in MethylSeq experiments. We then describe the MetMap framework, which is designed to correct for such biases, starting with a description of MetMap's graphical model and continuing with a description of the software's different outputs. We then describe the validation of MetMap's procedure, using the methyltypes of four human individuals, and our discovery of new unmethylated regions in the human neutrophil genome, found through the use of MetMap on MethylSeq data. Finally, we discuss the advantages of using MetMap with MethylSeq to generate and analyze large numbers of samples, and outline our plans for the extension of MetMap's framework.
MetMap is a statistical inference framework that exploits MethylSeq data to accurately ascertain the extent of methylation across a genome. It uses a novel graphical model to assign probabilities of methylation at single-HpaII-site resolution, annotate regions of the genome that are hypomethylated along with a score indicating their extent of hypomethylation, and indicate the sites that are in the scope of the MethylSeq experiment and may be included in comparative studies.
A central feature of MetMap is its ability to normalize the bias introduced in short-read sequencing experiments in which the genome is not randomly fragmented. In MethylSeq experiments, all unmethylated HpaII sites are present at the ends of the digested fragments, but a size selection step required by the sequencing protocol limits sequencing to fragments of a narrow size range. The methylation status of the neighbors of an unmethylated HpaII site determines whether the fragments with this site at their ends will pass the size selection step and be sequenced (
Suppose that due to the size selection step, only fragments of length 50–300bp are sequenced. The four adjacent restriction sites (denoted by circles) may have different methylation states, resulting in epialleles with different “neighborhood methylation structures” of B. Site B is sequenced only from fragments of type B–C–D, which are the product of alleles in which sites B and D are unmethylated (and cut) and site C is methylated (and not cut). (
Another bias is an issue of all “shotgun” sequencing experiments. The read count of a given fragment gives only an estimate of its abundance in the solution, and can be viewed as the number of times the fragment was randomly sampled. Therefore, different fragments present in the specimen in similar quantities will not always be sequenced to the same extent. The site-specific inference procedure used by MetMap considers the extent to which all fragments in a HpaII site's neighborhood were sequenced, so that more information is considered to determine the state of each HpaII site.
The analysis of a MethylSeq experiment by MetMap begins with the generation of a directed graphical model. The model's specific structure is determined by a reference genome and the specifications of the given experiment. We outline the different types of variables in MetMap's model, the dependencies between the variables, and how the data is incorporated into the “observed” states.
For a given reference genome, MetMap denotes every CG that is within a HpaII site (CCGG) by a random variable (denoted by
Dependencies between the variables are modeled using probability distributions of three types, making use of 54 parameters. The first type of probability distribution captures the dependencies between adjacent CG sites with respect to whether the sites are part of an unmethylated island, and is modeled by transition probabilities of a hidden Markov model (HMM) that incorporates the distances between adjacent sites. The second determines, for each
MetMap constructs a directed graphical model (b) from the genome and read counts (a). The methylation state of each CCGG site is represented by a random variable that also encodes whether it is in an unmethylated island. CpG sites are also represented in the model, with the distance between sites affecting the parameters. The read counts are used to set the state of the observed random variables corresponding to the possible sequenced fragments (for simplicity of representation, only a sample of these variables is outlined in the figure). The numbers in the blue circles represent normalized read counts. Dark edges correspond to boundaries of fragments. MetMap inferences of the extent of unmethylation (c) are shown alongside the values attained from a bisulfite sequencing validation. The raw read counts are scaled by the
In summary, the reference genome specifies the structure of the graphical model, and the MethylSeq data is incorporated into this model by fixing the states of all of the
After building the probabilistic model and assigning the
It is important to restrict analysis only to sites that are within the scope of the MethylSeq experiment, namely, to sites for which the MethylSeq experiment holds some information regarding their methylation state. MetMap identifies these sites from the structure of the graphical model (
As unmethylated CGs tend to be clustered in vertebrate genomes, we would like to annotate the coordinates of these clusters. We call such regions SUMIs (strongly unmethylated islands) and emphasize that they are defined by experimental data and so are specific to a dataset. In MetMap's graphical model the posterior probability of a variable to be in an “unmethylated island” state is dependent on both the genome sequence and the experimental data (for any
We carried out MethylSeq on specimens of a single homogeneous and uncultured cell type, the neutrophil, from four male humans. HpaII fragments were size selected in the range 50–300bp and sequenced on a first generation Illumina Genome Analyzer yielding 23,731,359 32bp reads. Although longer reads are currently available, reads for our assay only need to be sufficiently long so that they can be mapped correctly to the reference genome. The reads were aligned to the reference human genome (hg18
To infer methylation states from read depths, we first segmented the genome into 6,000 non-overlapping regions (of size 0.5Mbp) that could be analyzed separately. For each region, MetMap returned methylation probabilities for those CCGG sites for which information on site-specific methylation could be obtained from the MethylSeq experiment, and annotated SUMIs. The CCGG subset contained 59% of the CCGG sites (4.8% of all CG sites) in the human genome. Of the sites for which information could be obtained, 80% (1,035,243 sites) were outside CpG islands as annotated in the UCSC Genome Browser
To test whether MetMap was correcting bias in the raw counts (
We correlated the bisulfite scores (taken as being the true methylation status) with the read counts and with the MetMap predictions. Each of the 46 validated sites had three different scores for the extent to which it was unmethylated: a bisulfite score, a read count score, and a MetMap score. The Pearson correlation coefficient between the raw read counts and the bisulfite values was 0.67 while the Pearson correlation coefficient between the MetMap methylation score of those sites and the bisulfite values was improved to 0.90.
As the bisulfite scores may be an imprecise measure of the true extent of methylation (
Examples of MetMap's ability to accurately detect partially and fully methylated sites are shown in
To determine which parameter might be more informative for genome-scale methyltyping, we compared methylation states for individual sites and for SUMIs between pairs of samples. Although the methylation status of individual sites within SUMIs was variable, the average probability of methylation for the whole SUMI was consistent across individuals (
All pairings among the four individuals tested are shown. On the left side of each pair the correlations between the site specific MetMap scores are presented for sites within SUMIs. On the right side of each pairing the correlations of the SUMI scores are presented. The distribution of the sites that are highly unmethylated in one sample but methylated to different extents in the other sample is discussed in
Similar read counts at orthologous restriction sites in two or more samples indicate that their methylation status is similar; however determination of their true extent of methylation requires a statistical method such as MetMap. Thus the degree of consistency observed among MetMap's site-specific inferences for different samples is supported by the high correlation of the corresponding raw read counts (e.g.: a correlation of 0.667 between sample 1 and sample 4).
We mapped the 20,986 SUMIs present in at least one of the four individuals, and examined their relationship to purely sequence based definitions of CpG islands (
Genomewide SUMI predictions (a) reveal strongly unmethylated islands that are proximal to genes and that do not always correspond to sequence-based annotations of CpG islands shown in the tracks ‘BF islands’ and ‘CpG islands’ (e.g., the promoter of LRG1 and in an intron of SH3GL1). (b) SUMI and BF island length distributions have a different shape than the CpG island length distribution, suggesting numerous short false positives in the latter. (c). Some SUMIs appear 5′ of alternative promoter sites.
All Neutrophil SUMIs | Overlapping CGIs | Not Overlapping CGIs | Overlapping BFIs | Not Overlapping BFIs | Not Overlapping CGIs or BFIs | |
Sample 1 | 16,903 | 14,071 | 2,832 | 12,076 | 4,827 | 2,266 |
Sample 2 | 17,595 | 15,008 | 2,587 | 12,834 | 4,761 | 2,044 |
Sample 3 | 18,178 | 15,273 | 2,905 | 13,082 | 5,096 | 2,308 |
Sample 4 | 18,699 | 15,274 | 3,425 | 13,229 | 5,470 | 2,729 |
Union | 20,985 | 16,334 | 4,651 | 13,931 | 7,054 | 3,797 |
Intersection | 14,308 | 12,838 | 1,470 | 11,123 | 3,185 | 1,116 |
Union - The set of regions annotated as a SUMI in at least one of the four individuals. Intersection - The set of regions annotated as a SUMI in all four individuals. CGIs - UCSC CpG islands. BFIs - BF-islands.
We compared the length distribution of our SUMIs with the length distributions of both the UCSC and BF islands (
We therefore validated with direct bisulfite sequencing five regions that are annotated as part of both a UCSC CpG island and a BF island, and did not overlap with SUMIs; we also sequenced three regions in BF islands that did not overlap with SUMIs or with UCSC CpG islands. In all cases those regions were validated as methylated in the neutrophil samples (
3,797 SUMIs do not overlap with BF islands or CpG islands, revealing new regions that are unmethylated in neutrophil cells. Of these novel SUMIs, 2,317 (61%) are within regions experimentally determined by the ENCODE project as open chromatin (
Human Neutrophil SUMIs | UCSC CpG islands | BF islands | Novel SUMIs | |
Open Chromatin | 70.0% | 52.9% | 65.3% | 61.0% |
UCSC 17-way Conservation Track | 71.1% | 68.5% | 76.2% | 49.6% |
Gene Regions | 76.9% | 77.7% | 79.7% | 59.9% |
TSS Regions | 59.8% | 52.2% | 61.4% | 22.0% |
“Open Chromatin” - the union of the regions determined by the ENCODE project as open chromatin in five different cell types (
Consistently with their similarity to conventional CpG islands, SUMIs are enriched near the transcription start sites (TSSs) of RefSeq genes, with a preference for the downstream side (
The number of SUMIs that overlap each location within 5Kbp from RefSeq transcription start sites is shown for (a) all neutrophil SUMIs (b) Novel SUMIs (SUMIs that do not overlap UCSC CpG islands or BF islands) (c) SUMIs that do not overlap BF islands (d) SUMIs that do not overlap UCSC CpG islands.
The possibilities and potential of DNA methylation analysis with new sequencing technologies have been described as a “revolution”
The annotation of experiment-specific strongly unmethylated islands (SUMIs) reconciles the original definition of CpG islands, based on their sensitivity to methylation-sensitive restriction enzymes
Overall, we predicted 3,797 SUMIs that do not overlap UCSC CpG islands or BF islands. Their sequence conservation and correlation with open chromatin suggests that they are functional, but they are less frequently associated with transcription start sites than the general set of SUMIs. We speculate that many novel SUMIs are enhancers. The discovery of these novel regions illustrates the utility of using experimental data to annotate CpG islands.
As more methylation data becomes available, the MetMap program we have developed can be refined and improved. For example, with the advent of methylation-based case-control studies, it should be possible to define methyl-haplotypes and to leverage MetMap to explore variation within and between individuals. MetMap's graphical model can also be used to learn the dependencies between the methylation states of neighboring CG sites, which will expand the scope of MethylSeq experiments to include sites that are not directly assayed. As more data-types are produced together with methylation experiments, we envision expanding MetMap to include information from related genomes, and possibly other related measurements. Ultimately, we look forward to the coupling of methylation data with other functional information, including expression measurements and chromatin structure assays, to fully reveal the roles and consequence of DNA methylation.
Human samples were collected with CHORI's IRB approval after obtaining informed consent.
The MetMap software takes as input: (1) the mapped reads of a MethylSeq experiment, (2) the boundaries on the lengths of the fragments sequenced (determined by the size-selection step), and (3) a reference genome. It outputs two files: (1) a list of the HpaII sites in the scope of the experiment with their MetMap scores, and (2) a list of SUMI regions with their scores.
MetMap is free, open source software, and can be downloaded from the following site:
In the Methylseq experiment, information regarding the methylation state of a CCGG site can be obtained for the subset of CCGGs that are present on some fragment that has CCGG sites at its ends and that passes the size selection step (see “CG sites in the scope of the MethylSeq experiment” section for details). We computed the number of CCGGs of the human genome that fulfill this criterion to be 1,349,378.
In the RRBS protocol the genome is digested with the methylation-insensitive restriction enzyme MSPI (which cuts at CCGG sites), and the fragments of size 40–220bp are size-selected and have their ends sequenced (after bisulfite treatment). For the human genome RRBS determines the methylation status of
We determine the span of a methyltyping method by considering regions in which that method profiles methylation. By doing so we gain an insight to the broadness of a method with respect to the regions for which it profiles methylation. In MethylSeq, methylation status is determined for a subset of the CCGG sites and in RRBS methylation status is determined for CG sites that are within fragments that have CCGG sites on both ends and which are of relative short length (up to 220bp). We therefore computationally categorized all CCGG sites of the human genome as 1/0 based on the ability to infer their methylation state with each method. All regions (bounded by CCGG sites) in which all CCGG sites received a “1” were considered as spanned by the method. When determining the span for CpG islands, the regions spanned were computed in the same manner, but considered only regions within CpG islands. In cases that the CCGG nearest to an edge of the island was determined as “1” the region between that CCGG and the edge of the island was also considered as spanned.
In the protocol used for this study the genome is digested with the methylation-sensitive restriction enzyme HpaII and only CCGG sites that follow certain criteria (as outlined in Ball MP et al.) are considered for their methylation status. One of the requirements is that the CCGG site be at least 40bp away from at least one of its two neighboring CCGG sites. In the human genome 19% of the CCGG sites have both of their neighboring CCGG sites at a distance smaller than 40bp, and are therefore excluded from the analysis.
CpG islands in the UCSC track are defined in
We obtained whole blood from four young adult male humans and obtained neutrophils by first isolating peripheral blood mononuclear cells by Ficoll separation, then purifying neutrophils with anti-CD16 antibodies conjugated to magnetic beads (Miltenyi); we verified that the purified samples contain
MetMap receives as input the output of the MethylSeq experiment mapped to a reference genome, the reference genome, and the minimal and maximal lengths of the fragments sequenced, denoted by
Several types of probability distributions annotate the dependencies between the variables of MetMap's model. The transition probabilities between each pair of adjacent variables of type
Parameters 0.2269, 0.05 and 0.7231 determine the probabilities of having an
MetMap infers the posterior probabilities of its hidden states by building the junction-tree graph and using belief propagation
MetMap generates two output files. One holds for each HpaII site in the scope of the experiment a MetMap score, indicating the inferred frequency of alleles in the MethylSeq sample that are unmethylated at that site. The second file holds the coordinates and scores of the annotated SUMIs.
To generate a value for
MetMap outputs methylation scores only for the HpaII sites (CCGGs) that are in the scope of the MethylSeq experiment. A HpaII site is in the scope of an experiment if and only if it lies on some fragment that has HpaII sites at its ends, and is of length
The SUMI regions annotated by MetMap are the union of two sets of regions. The first set consists of those continuous regions in which each
The SUMI lists for the four human neutrophil samples can be found at:
DNA was treated with the MethylEasy bisulfite conversion kit (Human Genetic Signatures), PCR-amplified with locus-specific primers that recognized human target sequences, and sequenced using standard Sanger chemistry. Since all epialleles from a single specimen were sequenced in bulk in the same mixture, we estimated the ratio of unmethylated/methylated alleles at each CG in the sequence by examining the relative heights of the ‘C’ and ‘T’ traces in the sequencing output. Each CG site received a score from the set (0,0.25,0.5,0.75,1), based on the relative C/T peak height
We tested the extent to which our results may be affected by the representation of the bisulfite scores on a discrete five-point scale, since the true proportion of alleles that are unmethylated is a close to continuous measure. Each data point was assigned an ‘adjusted’ bisulfite score, within a tolerance window specified by the true bisulfite value of that data point. The ‘feasible ranges’ allowed for the ‘adjusted’ bisulfite scores were as follows: (0,0.15) for a 0 bisulfite score, (0.15,0.35) for a 0.25 score, (0.35,0.65) for a 0.5 score, (0.65,0.85) of a 0.75 score and (0.85,1) for a 1 score. For example, for a site with bisulfite score 0.25, read count score 0 and MetMap prediction 0.4 we would get two pairings (0.15,0) for (adjusted-bisulfite, read count score), and (0.35,0.4) for (adjusted-bisulfite, MetMap score). The score ranges were based on an assumption that assignments of “0.5” scores were the least precise. The adjustment of the bisulfite score to the read counts was done by generating a normalized read count value, in the 0–1 range, using the same “capping” value as MetMap.
One file of open chromatin was compiled from:
using the files: wgEncodeUncFAIREseqPeaksH1hesc.narrowPeak
wgEncodeUncFAIREseqPeaksNhek.narrowPeak
wgEncodeUncFAIREseqPeaksGm12878V2.narrowPeak
wgEncodeUncFAIREseqPeaksHuvec.narrowPeak
wgEncodeUncFAIREseqPeaksPanislets.narrowPeak
Validation of MetMap predictions by site-specific bisulfite sequencing.
(0.45 MB PDF)
Read counts of the different samples.
(0.09 MB PDF)
Supporting material on MetMap's algorithms and parameters.
(0.24 MB PDF)
Supporting information on MetMap's performance and sensitivity.
(3.48 MB PDF)
Supporting information for
(0.02 MB PDF)
We thank Lu Zhang from Illumina, Inc., Hayward, CA for the MethylSeq data used in this manuscript, the ENCODE project for generation of the FAIRE datasets, Sriram Sankararaman for many enlightening discussions and careful feedback on the manuscript, and Cole Trapnell for critical reading of the manuscript.