Conceived and designed the experiments: NDC. Performed the experiments: JL. Analyzed the data: WSG YO NDC. Wrote the paper: WSG NDC. Helped edit the paper: YO.
The authors have declared that no competing interests exist.
Genome wide maps of nucleosome occupancy in yeast have recently been produced through deep sequencing of nuclease-protected DNA. These maps have been obtained from both crosslinked and uncrosslinked chromatin in vivo, and from chromatin assembled from genomic DNA and nucleosomes in vitro. Here, we analyze these maps in combination with existing ChIP-chip data, and with new ChIP-qPCR experiments reported here. We show that the apparent nucleosome density in crosslinked chromatin, when compared to uncrosslinked chromatin, is preferentially increased at transcription factor (TF) binding sites, suggesting a strategy for mapping generic transcription factor binding sites that would not require immunoprecipitation of a particular factor. We also confirm previous conclusions that the intrinsic, sequence dependent binding of nucleosomes helps determine the localization of TF binding sites. However, we find that the association between low nucleosome occupancy and TF binding is typically greater if occupancy at a site is averaged over a 600bp window, rather than using the occupancy at the binding site itself. We have also incorporated intrinsic nucleosome binding occupancies as weights in a computational model for TF binding, and by this measure as well we find better prediction if the high resolution nucleosome occupancy data is averaged over 600bp. We suggest that the intrinsic DNA binding specificity of nucleosomes plays a role in TF binding site selection not so much through the specification of precise nucleosome positions that permit or occlude binding, but rather through the creation of low occupancy regions that can accommodate competition from TFs through rearrangement of nucleosomes.
Genomic DNA is largely covered by proteins that compete with one another for binding to regulatory sequences. Most of these proteins are in the form of nucleosomes. How nucleosomes come to occupy particular sites and thereby compete with sequence specific transcription factors is a central problem in developing a systems-level understanding of gene regulation. Here, we performed a series of computational analyses using high-resolution nucleosome position data that has recently become available in yeast, thanks to advances in DNA sequencing technology. Analysis of these data, combined with data on the location and occupancy of transcription factors genome-wide, shows that the precise location of nucleosomes as determined by nucleosome sequence specificity is often less important to transcription factor binding than the broader, regional occupancy of nucleosomes that is encoded in genomic DNA. This result has implications for the evolution of DNA regulatory elements.
Genomic DNA is largely covered in proteins, mostly in the form of nucleosomes. Much of the remainder consists of chromatin-associated proteins, including enzymes that modify histones or DNA, or catalyze the rearrangement of nucleosomes, and sequence specific DNA binding proteins (transcription factors) that mediate the activation or repression of genes. A deep understanding of gene regulation requires an understanding of how each of these cooperate and compete for access to genomic DNA.
That nucleosomes and TFs do, in fact, compete on a genomic-wide scale was substantiated several years ago by chromatin immunoprecipitation microarray experiments (ChIP-chip), which determined the distribution of histones along the yeast genome.
Competition between nucleosomes and transcription factors is a simple consequence of each having an inherent probability of binding to the same site. Some transcription factors may be able to bind DNA on the outside surface of a nucleosome, but steric occlusion of sites on the inside and the sharp bending of DNA around the nucleosome preclude most transcription factors from binding to nucleosomal DNA. As a consequence of this competition, nucleosomes can mediate interactions among transcription factors (TFs) in an entirely passive way. For example, a binding motif that is close to a second, TF-occupied, site will tend to have a lower nucleosome occupancy than it would in the absence of the occupied site because nucleosome configurations that span both the motif and a nearby occupied site are disallowed. This lower nucleosome occupancy translates into a higher effective binding affinity of the site. In this scenario cooperative binding of factors is mediated not by direct interactions between the factors but by the passive effects of nucleosomes due to mutual competition. This effect has been demonstrated experimentally.
Passive mediation of TF-TF interactions is one way in which nucleosomes affect the occupancy of TFs in the genome. A second is through the intrinsic sequence specificity of nucleosomes themselves. Nucleosomes lack highly specific amino acid side chain-to-base pair contacts that are characteristic of sequence-specific transcription factors, but they do have sequence preferences that are determined by the capacity of the DNA to be wrapped around the nucleosome. One manifestation of this preference is a subtle but significant tendency towards a 10bp periodicity of certain dinucleotide steps.
Kaplan et al recently provided a simple and elegant demonstration that the intrinsic sequence specificity for nucleosome binding is a partial determinant of nucleosome positioning in vivo.
Here, we revisit the data of Kaplan et al, analyzing it in the context of existing ChIP-chip data
The analyses in this paper make extensive use of the nucleosome mapping data reported recently by Kaplan et al, so it was important to first establish that we can replicate and extend an analysis that they performed. To that end, we took Abf1 motifs in the yeast genome and calculated a set of averaged profiles of nucleosome sequence tags around those motifs (
(A) Nucleosomal tag counts in the vicinity of Abf1 motifs as a function of p-value for ChIP enrichment of the site. Abf1 sequence motifs, and the sites bound at p≤1e-3, were defined by MacIssac et al.
We extended this analysis by testing other Abf1 sites for which the evidence for Abf1 binding is weaker. Remarkably, as the stringency for defining Abf1 binding is relaxed, from the most stringent (p≤1e-3) to the least (p>0.5), both the depletion of nucleosomes at the Abf1 motif and the enrichment of nucleosomes at adjacent flanking positions decrease, but not to the point where they disappear altogether (
To assess quantitatively the correlation between low nucleosome occupancy and TF binding we asked how well nucleosome tag counts correctly distinguish TF-bound sites from random sites selected from yeast promoters. We use the area under the ROC curve (ROC AUC: receiver operator characteristic area under the curve) as a measure of this association.
Kaplan et al used two different methods in their nucleosome mapping experiments, one involving formaldehyde crosslinking (two replicates) and the other a more traditional non-crosslinking protocol (four replicates).
For
To investigate this result further, we examined the crosslinked tag count distribution around Abf1 sites, as was done for
(A) (top): Tag counts of uncrosslinked chromatin (orange) and crosslinked chromatin (gray) in the region of bound Abf1 sites. Tag counts have been symmetrized around the Abf1 site. Tag counts for the crosslinked sample were normalized to the uncrosslinked sample between 100–600bp from the Abf1 site to highlight the concordance in the phased nucleosome locations and occupancies. (bottom): Tag count difference map (green) in the vicinity of bound Abf1 sites showing excess tags in crosslinked chromatin vs. uncrosslinked. (B) Predictive value of nuclease-resistant tag counts for binding of 41 TFs. ROC AUC values on the y-axis were calculated based on the difference map (excess tag counts found in the crosslinked sample compared to the uncrosslinked). ROC AUC values on the x-axis were calculated as in
The bottom panel in
That crosslinking appears to be trapping nucleosomes over TF binding sites is illustrated by the appearance of a very strong nucleosome peak in the difference map that lies right on top of a set of Gal4 sites in the GAL1–GAL10 promoter (
Regardless of the mechanism, the association between TF bound sites and excess tag counts in crosslinked chromatin suggests that difference maps based on crosslinked and uncrosslinked chromatin might be used to identify non-histone DNA-binding sites without ChIP enrichment for particular proteins. How such sites would compare to DNase hypersensitive sites or nucleosome poor regions defined by FAIRE
Kaplan et al. made the important observation that genomic loci that are bound by TFs in vivo tend to be also depleted for nucleosomes in reconstituted chromatin.
ROC AUC values are generally similar whether nucleosome occupancies are obtained in vivo or in vitro (
(A) ROC AUC values quantifying the predictive value of low nucleosome occupancy based on chromatin in vivo (x-axis) or chromatin reconstituted in vitro from genomic DNA and histones (y-axis). The in vivo data are as shown in
Neither the ROC AUC values nor the raw differences in tag counts used by Kaplan et al lend themselves to a simple interpretation in terms of the amount of TF binding information that lies in the intrinsic binding specificity of nucleosomes. To assess more directly how much of an effect on TF binding is encoded by the intrinsic DNA binding specificity of nucleosomes, we determined the apparent binding occupancies of 107 perfect consensus binding sites in the genome using ChIP-qPCR (
The fact that all four TFs show correlations in the expected direction between ChIP-qPCR enrichment and nucleosome occupancy attests to the sensitivity of this analysis because the ChIP-chip based ROC AUC values for these same TFs are only marginally different than the value expected by chance. It is possible that the ROC AUC values are underestimated due to the definition of unbound sites that we chose to use. We chose to use random sites selected from promoter regions (600bp 5′ to ORFs) thinking they would be more appropriate controls for the TF-bound sites, but the ROC AUC values obtained using this background are lower than what is obtained when sites randomly selected from throughout the genome are used instead (data not shown). A systematic underestimation of the true ROC AUC value would also explain why Nrg1 has an ROC AUC value below 0.5, implying a direct association between nucleosome occupancy and binding, even though our ChIP-qPCR analysis unambiguously shows the expected inverse correlation.
While absolute ROC AUC values should be interpreted with caution, comparisons of ROC AUC values are valid because each calculation was performed using the same set of unbound sequences as background. Since there are 41 TFs for which we have performed analyses using ChIP-chip data, we use those data and the ROC AUC metric for all subsequent analyses, rather than the correlation to ChIP-qPCR enrichment values, for which we have data for only four of the 41 TFs.
As discussed above, most of the 41 TFs are modestly associated with nucleosome depletion in vivo (
If transcription factor binding depends sensitively on the positioning of nucleosomes, we would expect high resolution data to produce a stronger association between nucleosome depletion and TF binding. To test this, we started with high-resolution data and simulated the effects of lower resolution data by averaging the high-resolution nucleosome occupancy data over windows of various sizes.
(A) Windowing scheme to simulate lower resolution data. Nucleosome tag counts around genomic loci (TF binding sites or control loci) were averaged over windows of 15, 40, 75, 150, 300 and 600bp as indicated by the lines at the bottom of the panel. Average tag counts around Abf1-bound sites are shown as in
We repeated this analysis for all 41 TFs, comparing the ROC AUC values obtained with 600bp windows to those obtained with 15bp windows (
Remarkably, the opposite effect is observed when in vitro reconstituted chromatin is used in the calculations rather than in vivo chromatin. An improvement in the association between binding and nucleosome occupancy is found for about three quarters of the TFs when in vitro nucleosome occupancies are averaged over 600bp. Among TFs that show the most dramatic effects (i.e. those exceeding the mean by 2σ), twice as many are improved as are made worse. At a cutoff of 1σ, three times as many are improved as are made worse. The number of TFs that are adversely affected by blurring of the in vivo data, and the number of TFs that are positively affected by blurring of the in vitro data, are each significantly different than the numbers expected by chance (p∼0.02).
The results with reconstituted chromatin are important because it is those data that are most relevant to an understanding of intrinsic nucleosome binding specificity and its effect on TF binding. The analyses of in vivo nucleosome data serve as a kind of computational control, showing that simulation of low resolution data does indeed weaken the association with TF binding, as would be expected if the precise nucleosome location, as defined by high-resolution sequencing experiments, were relevant to TF binding.
The question then, is why is there an improvement in the correlation between TF binding and nucleosome occupancy when high-resolution data for the in vitro nucleosome date are averaged so as to simulate lower resolution data? To investigate this question further, we examined more closely the patterns of nucleosome occupancy around TF binding sites in vivo and in vitro. To that end, we clustered the 41 TFs into five groups based on their in vivo and in vitro nucleosome profiles (
(A) Heat map of the 1200bp nucleosome occupancy profiles surrounding TF binding sites. Each row represents the average profile around binding sites for one of the 41 TFs. The left side is based on the in vivo nucleosome map; the right is based on the in vitro nucleosome map. Tag counts were separately normalized to a mean of 0 for each of 1200bp in vitro and in vivo windows. Yellow represents low tag counts, and blue high. The TFs have been placed into five groups based on k-means clustering (
As a further test of the effect of blurring high resolution nucleosome position data, we incorporated the data into a computational model that predicts TF binding to genomic regions.
Here, we used the same weighting function and parameter values developed previously, but instead of applying weights based on large genomic regions, we used the base-pair resolution, in vitro nucleosome position data of Kaplan et al.
(A) Illustration of how nucleosome occupancies are used to weight the predicted binding affinities of sequence motifs (top panel): Two 2.4kb genomic regions (CAN1/NPR2 and YPL137C/ISU1) showing normalized nucleosome tag counts from in vitro reconstituted chromatin, averaged over 15bp windows (gray line) or 600bp windows (black line). Red dots indicate the location of a perfect Gcn4 consensus site in each region. (middle panel): Same as the top panel except the lines show the conversion of normalized tag counts into weights that can be applied to Position Weight Matrix based estimates of TF binding affinity. Note that the weights are plotted on a log scale. Details of the weighting scheme are given in
For each TF, we obtained three ROC AUC values that express how well binding is predicted: one based on the PWM alone; the second based on the PWM, but with genomic position weights determined by high resolution nucleosome position data; and the third based on the PWM and weights determined by simulated low resolution nucleosome position data (i.e. high resolution data averaged over 600bp windows). For TFs whose binding is well predicted by genomic sequence and the PWM alone, the inclusion of weights based on nucleosome occupancy evidently adds noise to the calculation, worsening the predictions. However, for TFs whose binding is poorly predicted by sequence alone, the inclusion of binding affinity weights can substantially improve the prediction of binding (
Strikingly, the effect of intrinsic nucleosome position data on binding predictions is accentuated with the simulated low resolution data. This is the opposite of what we would expect if precise nucleosome positioning were typically of great relevance to the binding of transcription factors, and it is the opposite of what we observed in most cases with the in vivo data. Of course, the improvement in binding predictions with blurred data is for the set of bound promoters as a whole; within this set, some of the promoters bound by the TF fall in rank even if, overall, weighting improves the ROC AUC value (
The yeast genome sequence (Aug 2008 build) and gene feature files were obtained from the Saccharomyces Genome Database (SGD).
The nucleosome sequence tag data provided by Kaplan et al consists of a 5′ end, determined by sequencing, and a 3′ end 146bp away that is based on knowledge about the size of nucleosomes and on the preparation in the experiment of ∼150bp sized DNA by nuclease treatment and size-selection.
Receiver operating characteristic (ROC) curves and the area under those curves (AUC) were used to quantify the ability of a predictor (nucleosome tag counts) to correctly classify sequences as TF bound (defined by ChIP-chip) or unbound (randomly selected from yeast gene promoters). Where error bars are shown for ROC AUC values, these were estimated from 1000-fold bootstrap re-sampling. ROC AUC values were also used to quantify the predictive value of a TF binding prediction algorithm, with and without weights based on nucleosome occupancies. In this case, the predictor is the estimated TF binding occupancy and the question is how well that value classifies promoters as TF bound or unbound. Unbound sites were selected from 1000 randomly picked yeast promoters, defined as the 600bp region 5′ to the start of a gene.
For each of the 41 TFs, and for each of the two nucleosome datasets, we enumerated the number of nucleosomal sequences spanning each basepair in a 1200bp window. The tag counts were averaged across the center of the profiles, and normalized to the mean value in that profile. These 1200bp normalized windows were used to visualize the profiles for each TF (
The computer program GOMER was used to calculate predicted binding affinities in yeast promoters based on TF-specific position weight matrices (PWM) and, optionally, affinity-modifying weights that were applied to genomic regions based on nucleosome tag counts.
Yeast strains expressing TAP-tagged transcription factors BAS1, DIG1, GCN4 and NRG1 were obtained from Open Biosystems. For each TF, we identified a set of perfect consensus binding sites that lay within genomic regions enriched in the ChIP-chip experiments of Harbison et al.
A list of the sites assayed by ChIP-qPCR and their enrichment values is available as supplementary material. The GOMER program has been described previously and is made freely available from the authors on request.
We have confirmed the conclusion of Kaplan et al
The difference in TF binding associations for the in vivo and in vitro nucleosome data is most striking for the outliers Abf1 and Reb1. These two TFs are thought to play key roles in chromatin remodeling and the formation of nucleosome free regions (NFRs) in yeast promoters
The most important question we have addressed in this paper is the following: how much does the preferred location of nucleosomes matter to the selection and occupancy of binding sites by transcription factors. The way we sought to answer this is to ask whether data resolution is important to the conclusion that TF binding is associated with low nucleosome occupancy regions. Lower resolution data was simulated by averaging the high resolution data over increasingly larger windows. If the precise location and occupancy of nucleosomes is of great importance for the binding of TFs, then lower resolution data ought to do a worse job of showing the relationship between nucleosome occupancy and TF binding. We find that this is generally true for the in vivo data, which is determined in part by TF binding. However, just the opposite is true for the in vitro nucleosome data: simulating low resolution data by averaging over 600bp windows generally improves the predictions. This could not have been the result if mapped nucleosome positions were both accurate and strongly preferred over alternative positions.
For three reasons, we believe the resolution to this observation lies not in questioning the accuracy of the nucleosome occupancy maps, but in the assumption that the precision of intrinsic nucleosome binding matters a great deal to where transcription factors bind. First, the methodology for mapping nucleosomes was the same in vivo and in vitro. The in vivo maps show the expected trend, wherein a blurring of the data weakens the correlations, and we know of no reason to expect or believe that the accuracy of nucleosome mapping is different in the two different chromatin preparations. Second, the energetic differences between a preferred nucleosome configuration and an alternative are not expected to be large, in general.
There are several exceptions to the rule that the blurring of in vitro nucleosome data improves the association between nucleosome occupancy and binding. These exceptions tend to be TFs like Fhl1 and Rap1 that have relatively high nucleosome density flanking their binding sites. There is also a modest tendency for these TFs to be bound less frequently at TATA+ promoters (15±6% vs 27±9%;
The second important observation we report is the difference between nucleosome maps constructed in the conventional way, from uncrosslinked chromatin, and those constructed from formaldehyde-crosslinked chromatin. This difference is not only well correlated with TF binding but is, if anything, better correlated than nucleosome occupancy in the uncrosslinked sample.
The origin of this effect is uncertain. Conceivably it is a consequence of differences in higher order chromatin around TF binding sites, or it may be that the crosslinked TF itself provides protection against nuclease digestion. However, both explanations would require that at least some of the protected DNA survive size selection for mononucleosome-sized DNA. Another problem with the TF-protection explanation is that we would expect higher TF concentrations to increase the difference between crosslinking and non-crosslinking at binding sites, whereas the opposite is true, at least for Gal4 binding at the GAL1–GAL10 promoter (
Alternatively, and more simply, the excess sequence tags may be due to transient nucleosomes sitting on TF binding sites. Crosslinking might be expected to have the greatest effect on nucleosomes that are ‘volatile’ relative to other nucleosomes in the genome: nucleosomes with slow association/dissociation kinetics (slow relative to nuclease treatment) should be relatively unaffected by crosslinking, while nucleosomes with fast kinetics should have their apparent occupancies increased because of the ease with which nucleosomes can be crosslinked to DNA. Competition with TFs can be expected to alter the apparent kinetics of nucleosomes by competing with them for reassociation, and histone turnover measurements have indeed shown faster exchange kinetics at yeast promoters
Whatever the mechanisms, it seems clear that there is an association between regions of regulatory protein binding and higher nucleosome lability. The crosslink-noncrosslink difference map, which seems to be identifying labile nucleosomes, might therefore be used to discover non-histone protein binding sites in the genome.
The interactions among nucleosomes, transcription factors, and the enzymes that act on DNA and chromatin are complicated, but central to a deep understanding of gene regulation. Nucleosomes are a dominant factor in these interactions because they cover roughly 80% of the genome. Together with their intrinsic DNA sequence specificity, this adds further complexity to the problem. Our analyses suggest a simplifying principle to this complexity, namely that the precise position defined by nucleosome sequence specificity is not (on average, and for most TFs) of critical importance. Instead, the genome has evolved to define regions of lower and higher intrinsic nucleosome occupancy and these broad regions typically matter more than the precise most-favored configuration. Having said that, we expect there will be many exceptions in which precise positions are proven to be important. The technology now exists to explore these phenomena in greater detail and to begin to examine the kinetics of remodeling from one chromatin state to another. As data accumulates, we are confident that the incorporation of DNA-encoded nucleosome position information into computational models of TF binding will continue to improve the predictive quality of these models.
Tab-delimited ChIP-qPCR data used in
(0.01 MB TXT)
Crosslinking of chromatin generally weakens the association between low nucleosome density and TF binding. ROC AUC values for 41 TFs are generally lower when crosslinked chromatin data is used rather than uncrosslinked.
(0.73 MB EPS)
Genome browser tracks showing nucleosome tag counts in the GAL1–GAL10 regions for cells grown in YPD or in galactose-containing media. All data are from Kaplan et al (ref 9). Profiles are shown for crosslinked chromatin, uncrosslinked chromatin, and the difference between the two. A prominent peak in the difference map is apparent in YPD conditions, but is less prominent in galactose, presumably because Gal4 binding reduces the occupancy of the occluding nucleosome. The existence of the difference peak in YPD implies the existence of an unusually labile nucleosome over the Gal4 binding sites. Gal4 is known to have some occupancy of its GAL1–GAL10 promoter sites even in glucose, so the apparent lability of this nucleosome probably reflects modest competition from Gal4.
(2.02 MB EPS)
Nucleosome occupancy profiles around TF binding sites for all 41 TFs with at least 50 bound sites, and for the randomly selected promoter sites used as controls. Nucleosome profiles in black are from the in vivo data; profiles in red are from the in vitro data (ref 9). In vivo and in vitro data were each normalized genome-wide to a mean value of 1, and symmetrized around the center of the binding sites. The TFs are organized into color-coded blocks based on the k-means clusters of
(4.46 MB EPS)
Window-size dependency of nucleosome-TF binding correlations and the relationship to nucleosome occupancy profiles. (A) The difference in ROC AUC values at different window sizes was used to cluster TFs by k-means clustering. k = 5 was chosen to mirror the clustering of TFs by nucleosome occupancy profiles (
(5.02 MB EPS)
Not all binding sites or promoters are affected in the same way by data blurring. (A) Gst1 is the TF that shows the strongest positive effect of data blurring of the in vitro transcription data (
(0.79 MB EPS)
For the minority of TFs whose association between binding and nucleosome-poor regions is worsened by blurring of the nucleosome occupancy data (circles lying below the diagnonal), there is a slight tendency for those TFs to also be bound less frequently to TATA-containing promoters (lighter color). However, the effect is small and it is likely that any connection between the two properties is indirect.
(0.43 MB EPS)
We thank our colleagues, especially Jonathan Aow and Mikael Huss, for helpful discussions. We are also grateful to the reviewers. Their numerous helpful suggestions and comments have substantially improved the paper.