Conceived and designed the experiments: ZDZ JC. Performed the experiments: ZDZ. Analyzed the data: ZDZ. Contributed reagents/materials/analysis tools: JR MS. Wrote the paper: ZDZ MBG.
The authors have declared that no competing interests exist.
ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion.
ChIP-seq is an apt combination of chromosome immunoprecipitation and next-generation sequencing to identify transcription factor binding sites in vivo on the whole-genome scale. Since its advent, this new method has generated much excitement in the field of functional genomics. Proper computational modeling of the ChIP-seq process is needed for both data scoring and determination of adequate sequencing depth, as it provides the computational foundation for analyzing ChIP-seq data. In our study, we show the characteristics of ChIP-seq data and present in silico ChIP sequencing, a computational method to simulate the experimental outcome. On the basis of our data characterization, we observed transcription factor binding sites with excessive enrichment of sequence tags. Our simulation results reveal that both the genomic background and the binding sites are not uniform. On the basis of our simulation results, we propose a statistical procedure using the more realistic genomic background model to identify binding sites in ChIP-seq data.
Gene expression is carefully regulated in all living cells. Only a fraction of the genes in a genome are expressed to various degrees under a given condition or in a particular cell type. The main control of such regulation occurs at the transcription level: the RNA polymerases transcribe genes following binding of trans-acting transcription factors to cis-acting regulatory DNA sequences within genes or in their vicinities. To determine the biological functions of transcription factors, it is imperative to identify their binding sites and target genes in the genome.
Currently the most commonly used high-throughput method for identifying transcription factor binding sites (TFBSs) is chromatin immunoprecipitation followed by microarray hybridization (ChIP-chip)
Instead of using microarrays to identify the sequences of the immunoprecipitated DNA fragments, new methods have recently been developed to take advantage of the fast-maturing next-generation massively parallel sequencing technologies. In one such method, ChIP-PET
Proper computational modeling of ChIP-seq process is needed for both data scoring and determination of adequate sequencing depth, as it provides the computational foundation for analyzing ChIP-seq data. Here we show the characteristics of ChIP-seq data and present in silico ChIP sequencing, a computational method to simulate the experimental outcome. Our simulation results reveal that both the genomic background and the binding sites are not uniform. Such nonuniformity in the background will have important implications in ChIP-seq data analysis and binding sites identification.
ChIP-seq data are generated in a straight-forward manner, by high-throughput sequencing and subsequent sequence alignment. Because Illumina/Solexa 1G Genome Analyzer generates a very large number of short sequence reads, ChIP sequencing is currently done mainly with this sequencing platform. This could change in the future, however, as other high-throughput sequencing technologies may become better suited. Here we briefly describe the procedure of ChIP sequencing with the Solexa platform. The immunoprecipitated DNA fragments are sequenced from one end for approximately 30 bp. These short sequence reads are aligned to the human reference genome, and only uniquely mapped reads (typically 60–80% of all sequence reads) are retained for the downstream analysis. Based on size selection after gel electrophoresis prior to sequencing, the retained reads are elongated into longer tags by directional extension to the mean length of the size selected DNA fragments and then transformed into profiles of the number of overlapped DNA fragments at each nucleotide in the reference genome
For our analysis, we link overlapping tags into tag clusters (
(A) The signal profile map of STAT1 ChIP-seq data on human chromosome 22. (B) The same signal profile map in a small genomic region on human chromosome 22. (C) The sequence tags and the overlap profile of a tag cluster. This cluster, simplified as green lines in (A) and (B), is defined by 16 sequence tags, each of which is a uniquely mapped sequence read (dark green) plus its directional extension (light green). The outer and the inner forms of this cluster, bound by the two gray and the two white dashed lines respectively, have corresponding tag counts 16 and 12.
To identify transcription factor binding sites in ChIP-seq data, we assess the statistical significance of each tag cluster found in the actual data by assigning it a
The simulation starts with the removal of sequence gaps and repeats from the genomic region—the entire genome or a part of it—under consideration. It is followed by the random placement of
Given this null distribution, for tag cluster
For our simulation of ChIP sequencing (
The white segments with dashed borders represent the removed sequence gaps and repeats. The 1-Kb background blocks are rendered by the blue segments with a darker blue for a higher sampling weight. The 500-bp binding sites are represented by the dark gray boxes before sampling weight assignment and green boxes afterwards with a darker green for a higher sampling weight. Notice if a background block has a sampling weight high enough, it can “attract” a similar number of tags as a binding site can. See the main text for a detailed description of the procedure.
The process of the chromosomal immunoprecipitation and the subsequent unique mapping and extension of sequence reads can be simulated by randomly placing uniquely mapped sequence tags onto the chromosome, according to certain sampling weight at each nucleotide position. Such weights are generated first for the background nucleotide positions and then for those in the binding sites. For a uniform background, every nucleotide position in the background is given one as its sampling weight. For a varying background, if we assign each nucleotide position a different weight, given the large size of the human genome it becomes computationally prohibitive to sample the background many times as the simulation requires. Instead, we partition the background into adjacent blocks of nucleotide positions. After testing different block sizes ranging from 500 bp to 5 kb, we find they all give practically identical simulation results. In the end, we choose 1 kb as the block size. Every adjacent 1-kb block in the background is given a random weight drawn from a pre-specified underlying distribution and all nucleotide positions in a block are assigned the same weight. For the background variation, we assume that most of the background has a low sampling weight as most of the background is not enriched in the immunoprecipitation (the working principle of ChIP) but a few places of it have relatively high weights, comparable to some binding sites. Based on this assumption, we use a gamma distribution,
To specify the sampling weights in the binding sites, we first calculate
We implemented our ChIP-seq simulation method in R and wrote several auxiliary programs for text processing in Perl. The whole software package with source code and documentation is available for download at
For our analysis and simulation of ChIP-seq data, we used the dataset generated from STAT1 DNA binding under IFN-γ stimulation by Robertson et al.
While the majority (1,149,405, >90%) of these tag clusters comprise only one or two tags, a relatively small number (661) of them contain large numbers of tags (50 and more, the outer-overlapping count) and consequently show high stacking peaks (the inner-overlapping count) in their profiles (
(A) The genome-wide outer counts and their frequencies. The black line is the linear regression on the log-log scale from outer count 2 to 100. (B) The genome-wide inner counts and their frequencies. The black line is the linear regression on the log-log scale also from inner count 2 to 100. (C) The outer counts and their fractions on the whole genome and each individual chromosome. (D) The inner counts and their fractions on the whole genome and each individual chromosome.
We also examined tag counts on individual human chromosomes separately to check for possible discrepancies in their distributions on different chromosomes. The plots in
In our simulation of the ChIP-seq process, we use either uniform or varying sampling weights on the genomic background and among the binding sites for the tag placement. The four simulated datasets generated from the resultant combinations of the background and the inter-site distributions fit the actual data in very distinct ways (
(A) Uniform background and uniform binding sites. (B) Varying background and uniform binding sites. (C) Uniform background and varying binding sites. (D) Varying background and varying binding sites. The outer counts are used for all for plots. In each plot the actual tag count distribution is plotted as the black line and five simulated distribution with the enrichment coefficient
The four combinations of the background and the inter-site distributions can be seen as a gradual increment in the overall simulation complexity: from a simple model that assumes uniformity in both the background and the binding sites to one that assumes variation in either of them and to the most complex one that assumes variation in both of them. The simplest model assumes that the tag placement is identical everywhere on the background and also identical among the binding sites. Data generated from this model give a distribution of tag counts that is a very poor fit to the actual one: not only is there a depletion of tag clusters with small to medium tag counts due to an excess of single tags being placed onto the genome, but also clusters with large tag counts are completely absent (
The slightly more complicated second model assumes identical binding sites but a varying background for tag placement. The simulated data fit the actual distribution well at small to medium (1 to ∼5) tag counts but there is still a complete absence of clusters with large tag counts (
To identify STAT1 binding sites, we can assess the statistical significance of each tag cluster found in the actual data using a null distribution of tag counts derived from a background model. For the initial assessment, we used a simple background model that assumes equal probabilities for random tag placement at every available nucleotide position in the genome and combined 500 independent replicates of background simulation to generate such a null distribution. After assigning
In light of the simulation results, we can reassess the statistical significance of each tag cluster found in the actual data by using the varying-background model and combining 500 independent replicates of background simulation to generate the null distribution of the tag count. As before, we assign
Using the full sets of reads, we identified 28,434 and 5,307 STAT1 binding sites with and without IFN-γ stimulation respectively (
Four distributions of tag counts are plotted in
Plotted in blue and green, respectively, two null distributions are generated with the uniform- or the varying-background models. The actual distributions with and without IFN-γ stimulation are depicted by the black and purple lines.
To check how closely the varying-background model models the background in the actual experimental data, we compared the distribution generated under this model with the actual one without the IFN-γ stimulation. In response to the stimulation, STAT1 binds to numerous promoter elements to upregulate interferon stimulated genes. Without the stimulation, the role of STAT1 as a transcription factor is limited. Given such a difference in the DNA binding of STAT1 in the presence or absence of IFN-γ, we expect the distribution of tag counts from the experiment without stimulation should be a distribution dominated by a significant background with a small right tail from its limited DNA binding. This is exactly what we see in
We generate synthetic ChIP-seq datasets under simulation models with various assumptions for the binding sites and the genomic background. By comparing the simulated dataset with the actual one, we assess the goodness of the assumptions made in each simulation and thus can gain insight into the actual ChIP-seq data generating process: the closer the simulated dataset is to the actual one, the closer the assumptions are to the real process.
We use the uniform and the varying models for both the background and the binding sites in our simulation. In
These simulation results clearly show that neither the binding sites nor the background is uniformly presented in ChIP-seq data. Due to the inherent random noise in the experiment, binding sites are unlikely to contain the same number of mapped sequence tags. Not all the variance in the number of sequence tags mapped to binding sites could be explained by random noise, which should be counted by the uniform-site model as the simulation itself is intrinsically a stochastic process. Because DNA segments containing the binding sites are enriched by immunoprecipitation, the variance should also reflect the different DNA-binding affinity that a transcription factor has for its binding sites. Such variation could be the result of differences in either the nucleotide sequences of the binding sites
Perhaps more importantly, our simulation results also reveal that there is a substantial variation in the tag placement on the genomic background. Obviously, such background variation cannot be explained by the uniformity of background currently assumed in ChIP sequencing. Instead, our results suggest a varying background that is mildly fluctuating and contains some “hot” spots with relatively high ChIP enrichment comparable to some binding sites. The presence of such background ‘hot’ spots in the ChIP-seq data may be caused by preferential sequencing particular to the sequencing protocol/platform used in the experiment. Their enrichment through immunoprecipitation is precluded, however, as the background DNA segments are not bound by the transcription factor. Our inference of a varying genomic background not only raises questions about both biology and technology involved in ChIP sequencing but also has important practical implications to the analysis of ChIP-seq data as it provides a better background model (see next subsection for explanation).
To examine our simulation results more closely, we plot in
Only the actual tag count distribution and four simulated ones, one (the green line) from each panel of
As marked by the dashed circles and lines in
Reported in two recent studies
The known-sites method has the advantage in giving the sensitivity and the specificity of a particular ChIP-seq experiment at a chosen threshold. Its applicability is, however, problematic since it requires a “gold standard,” a list of true positives and true negatives. Conceptually, the validity of such a ‘gold standard’ is questionable given the dynamic nature of protein–DNA association—i.e., under different conditions a transcription factor has different DNA-binding profiles. Operationally, this method is also difficult to use. The prerequisite functional “gold standard” is rarely available, let alone a good one. Moreover, the “known” positives are biased towards binding sites with high enrichment of sequence tags, and as the majority of the genome is not bound by a transcription factor ever, it is an open question how many “true negatives” should be included in the calculation. That is, given the huge preponderance of negatives, it is very difficult to build a correctly balanced gold standard, which is essential for training an effective classifier
Instead of using a “gold standard” to identify binding sites in ChIP-seq data, the background-simulation method uses a background model to simulate how sequence reads are distributed in a genome in the absence of binding sites. Since this method does not assume any prior knowledge about the binding sites of the transcription factor under investigation, it avoids major difficulties encountered by the known-sites method. In their study, Robertson et al used a background model that implicitly assumes uniform tag placement everywhere on the background. However, our simulation results show that the data generated by this uniform-background model agree poorly with the actual experimental data. Based on our further analysis, we can generate a better null distribution by using a more realistic, varying-background model that assumes most of the background is not enriched but at a few places it has a high enrichment level on a par with some binding sites.
In our analysis we estimated the background and the foreground together from the ChIP-seq sample data alone. However, if the negative control data from the experiments without immunoprecipitation are available, the estimation of the background becomes simpler as such experiments give a direct empirical estimate of the ChIP-seq background. Because our method can simulate the background alone, the negative control data can thus be easily accommodated. First the control data are used to estimate the parameters of the varying background model. The fitted model is then used to generate the null distribution of the tag count. And finally this null distribution is used to score the ChIP-seq data.
We also make improvement to the usage of the null distribution in the background-simulation method. In the study of Robertson et al, the false discovery rate is defined as the ratio of the number of peaks at and above a threshold in the simulated data to that at and above the same threshold in the actual data. The implicit assumption behind this definition is that the peaks identified in the simulated data are false positives and the number of them is equal to the number of false positives in the actual data. The first half of this assumption is reasonable, but the second half is unwarranted. For direct comparability, the same number of uniquely mapped sequence tags as contained in the actual data is used to simulate the null distribution on the background. Due to the finiteness of this number and the presence of binding sites (the true positives) in the actual data, the number of the peaks identified in the simulated data will be greater than the number of false positives in the actual data at any threshold. This discrepancy is more pronounced at lower thresholds. In fact, at low thresholds there could be more peaks in the simulated data than in the actual data. When this happens, the false discovery rate exceeds one, which is nonsensical. Instead of using the null distribution in such an
The goodness of fit between the simulated and the actual curves.
(0.03 MB PDF)
Publicly available ChIP-sequencing datasets and the number of sites identified using varying-background model.
(0.03 MB PDF)