Conceived and designed the experiments: ES BFP. Analyzed the data: ES. Contributed reagents/materials/analysis tools: ES. Wrote the paper: ES BFP.
The authors have declared that no competing interests exist.
Apparent occupancy levels of proteins bound to DNA in vivo can now be routinely measured on a genomic scale. A challenge in relating these occupancy levels to assembly mechanisms that are defined with biochemically isolated components lies in the veracity of assumptions made regarding the in vivo system. Assumptions regarding behavior of molecules in vivo can neither be proven true nor false, and thus is necessarily subjective. Nevertheless, within those confines, connecting in vivo protein-DNA interaction observations with defined biochemical mechanisms is an important step towards fully defining and understanding assembly/disassembly mechanisms in vivo. To this end, we have developed a computational program PathCom that models in vivo protein-DNA occupancy data as biochemical mechanisms under the assumption that occupancy levels can be related to binding duration and explicitly defined assembly/disassembly reactions. We exemplify the process with the assembly of the general transcription factors (TBP, TFIIB, TFIIE, TFIIF, TFIIH, and RNA polymerase II) at the genes of the budding yeast
For proper cell function, cells need to precisely coordinate the expression of their genes on their DNA at precise times. In order to better understand how the cell works, it is important to understand how, when, and why a cell needs to turn on or off certain genes at certain times. In order to assist the cell to properly express its genes, there are hundreds of proteins that can bind and access DNA. Each protein has a unique function and these proteins assemble together into a very large complex to turn on genes. The assembly of these proteins has defined to some extent, however the whole process of assembly and disassembly of this complex in the cell is still poorly understood. In our modeling analysis, we have attempted to utilize genome-wide binding data to better understand how the transcription machinery that “reads” genes might disassemble, in light of what is known about the assembly process. This knowledge helps us better understand how cells coordinate their on/off-switching of their genes.
Eukaryotic genes are thought to be regulated by hundreds of proteins that assemble into pre-initiation complexes (PIC's) at promoters using an ordered pathway. One aspect of the PIC assembly pathway involves the recruitment of the general transcription factors (GTF's), such as TBP and TFIIB, by sequence-specific activators. TBP and TFIIB then contribute to the recruitment of RNA polymerase II (pol II) and other GTF's, which eventually start transcription.
A fundamental question concerning our understanding of gene regulation is the extent to which each assembly and disassembly step is distinct at every gene in a genome. Is the traditional biochemical view that TBP “locks in” or commits to a promoter, and in a recurring manner nucleates PIC formation valid in vivo? And is the PIC disassembly process in vivo, simply the reverse of the assembly process? Parts of the assembly/disassembly pathway have been rigorously defined in vitro with a few purified proteins and DNA, and this has provided us with our current parsimonious view of PIC regulation
The goal here is to evaluate in vivo occupancy data in light of biochemical mechanisms that are intended to reflect the corresponding in vivo reaction. The extent of biological insight is predicated on rather subjective assessments of the assumptions associated with interpretation of in vivo data. Within the context of declared constraints and assumptions, we propose a means to model in vivo protein-DNA occupancy data, so as to better integrate and conceptualize massive genomic datasets. This study is focused on the means of such modeling and the assumptions inherent in the data, using specific examples of PIC assembly.
Currently, perhaps the most widely used assay to measure the occupancy of proteins at genes in vivo is the chromatin immunoprecipitation assay (ChIP). In ChIP, proteins are crosslinked to DNA, the protein is then purified, and the bound DNA identified either through directed PCR or through genome-wide detection platforms (ChIP-chip and ChIP-seq). In this way, for example, the relative occupancy level of TBP, TFIIB, pol II, and many other proteins at every promoter in the genome in a population of cells can be assayed.
Recent studies using differential ChIP and photobleaching experiments have provided compelling evidence for a dynamic state of PIC components in living cells
The existence and origins of distinct occupancy levels of PIC components on genes has not been systematically explored, and thus is the impetus for conducting the modeling studies described here. Differential occupancy patterns for the GTFs have been described
Here, we develop a ChIP modeling program, termed PathCom, in the context of a fixed PIC assembly pathway to infer allowable dissociation mechanisms. We validate the simulation using an existing chemical kinetics simulator COPASI
The overall goal here is to inter-relate ChIP in vivo occupancy data with biochemical assembly/disassembly mechanisms, in a way that attempts to support or dispute such mechanisms. Such inter-relationships can be complex when one considers that hundreds of proteins are involved in transcriptional regulation. Therefore, we start by modeling only two factors (the GTF's TBP and TFIIB), and increase complexity by adding more GTFs one at a time up to six factors. While we focus on PIC assembly/disassembly mechanisms on a genomic scale, any number of factors and combination of assembly/disassembly steps in gene regulation may be considered, given that all proteins (or species) come together to form a complex.
TBP (T) binds to DNA (D) to form a protein-DNA (TD) complex, and in the presence of TFIIB (B) form a TDB ternary complex (
The constant availability of energy to drive directional processes allows the pairing of any association and dissociation mechanism. Consequently, there are four paths by which an in vivo occupancy level is achieved for a two-component reaction. The availability of only two experimental constraints (TBP and TFIIB occupancy levels on DNA) is insufficient to specify the predominant association and dissociation pathways. In the absence of a necessary additional experimental constraint, we created a hypothetical constraint for the purposes of modeling, in which we eliminated all but one association pathway. That allowed us to evaluate the two possible dissociation pathways. The reciprocal modeling could also be done, by eliminating all but one dissociation mechanism. Since the purpose of this study is to demonstrate how the modeling works and to discuss its assumptions, caveats, and utility, we illustrate the process using a single association pathway that has good experimental support and model all possible dissociation pathways.
Biochemical
Using published genomic datasets of TBP and TFIIB occupancy
To compare occupancy levels between proteins, it was necessary to place them on the same scale. We achieved this by scaling ChIP occupancy values (fold over background) for each factor from 0% to 100%. Our rationale, assumptions, and method for doing this are described in the
From this analysis, several insights were obtained: 1) Some occupancy levels simply do not distinguish among mechanisms. 2) In contrast to the simplified in vitro derived biochemical mechanism, TFIIB might remain at most promoters after TBP has dissociated (although TFIIB may nevertheless be dynamic). How TFIIB does so is a matter of speculation that the data do not address.
Based upon known TBP/TFIIB/DNA biochemical interactions, the notion that TFIIB might dissociate after TBP would seem untenable. However, the additional complexity that exists in vivo might accommodate such a mechanism if other proteins not explicitly defined in this model retained TFIIB at the promoter, after TBP had dissociated. TFIIB engages pol II at promoters via specific interactions
Towards our goal of modeling the assemblage of many proteins, we next consider a three-factor assemblage. The interaction of TFIIB with pol II (P) and TBP is structurally and biochemically well defined
These two rules, together, determine which dissociation mechanisms will be compatible with the data given an assumed association pathway. Note that depending on the actual percent occupancies, these rules will have varying effectiveness in narrowing down the dissociation mechanisms. If the rank order of observed occupancy is the same as the order of association, then all dissociation mechanisms will work.
We transformed these queries into a program termed PathCom (short for Pathway Compatibility), which was used to generate the compatibility chart in
We sought to validate the approach taken by PathCom, to ensure that it reflected enzymological concepts for which this modeling attempts to emulate. Our validation employed COPASI, a freely available program that simulates biochemical kinetics
To maximize the parameter search space and avoid local minima, COPASI imposes some randomness in moving through the decision-making process. Since the system is under-constrained and randomness is involved, each repeated modeling run converges on a different solution for each mechanism (i.e., many different combinations of rate constant values can produce the observed occupancy levels, if a solution can be found). The values of the underlying rate constants generated by the Parameter Estimator in COPASI are not meaningful; rather the resulting E-value provides a quantitative measure of the suitability of a mechanism to fit the data. Re-running COPASI on the same dataset returns essentially the same E value (not shown). Thus, COPASI provides a robust means of evaluating alternative mechanisms and validating PathCom.
Importantly, the analysis indicates that given a fixed association mechanism, there are a limited number of dissociation mechanisms (green squares in
In principle, dissociation of pol II may proceed via removal into the bulk nucleoplasm and/or translocation down the DNA upon transcription, where ChIP occupancy would not be detected by microarray probes at the 5′ ends of genes. Consistent with the latter possibility, high transcription frequencies are observed at the (H, H, L) set, which has high TBP and TFIIB occupancy but relatively low occupancy of pol II (
The suggestion that TFIIB dissociates after both TBP and pol II dissociation is consistent with some reports in the literature
We further examined the plausibility that TBP might not be fully bound at “high” occupancy promoters by looking at experimentally determined “digital footprints” of TBP bound at those promoters having the highest TBP occupancy (
Groups of genes that had very few members (e.g., (H, L, L) and (H, L, H)), or had very low occupancy of all tested factors (e.g. (L, L, L)) are expected to have higher variation, and thus less reliably interpreted. Therefore, these groups were not examined further. For the remaining groups, one to two mechanisms were found to be compatible. A common theme was that TBP dissociated first, then pol II, and then TFIIB, which was consistent with the conclusions drawn from the two-factor assembly analysis described above.
As more factors were added to the modeling, and genes grouped according to low or high occupancy levels of each protein, the number of possible groups grew exponentially (2n, where is the number of modeled proteins). Consequently, membership in each group diminished, some to negligible levels. Those with negligible membership did not represent predominant patterns and may have arisen by chance as a consequence of noisy occupancy levels. Therefore, we combined groups of genes that lacked a viable membership level (see
Using the in vitro model for PIC assembly, we next added TFIIH (H) to the mechanism: TBP → TFIIB → pol II→ TFIIH. This mechanism is applicable even if pol II and TFIIH were entering together. As shown in
In the four-factor mechanism, groups having a relatively large gene membership typically were limited to being compatible with only one or two of the 24 theoretically possible dissociation mechanisms (
We next added TFIIF (F) (
See
The occupancy levels in the five-factor modeling were compatible with mechanisms that had TBP and pol II dissociate early and TFIIB and TFIIF dissociating late (
In modeling six factors (
Genome-wide occupancy data for the many hundreds of proteins involved in gene regulation is now accumulating. One major challenge has been to inter-relate such occupancy data and conceptualize it in light of models about how these proteins function together. Such models, as in the case of the assembly of the transcription machinery at promoters, are derived from biochemical experiments conducted on isolated components of the transcription machinery. The extent to which inferred biochemical mechanisms reflect in vivo processes is not known. We are not aware of any means of modeling genome-wide occupancy data to determine whether it is compatible with biochemical mechanisms. To this end, we developed the software tool PathCom. PathCom is generic in that it will determine whether any number of user-defined mechanisms is compatible with measured occupancy data of any number of relevant proteins. We applied PathCom to transcription complex assembly/disassembly, which has been extensively defined biochemically and for which genome-wide ChIP-chip occupancy data is available for. Biological insight gleaned from the modeling is subject to the veracity of the assumptions regarding what in vivo ChIP occupancy data actually measures, and the quality of the data being modeled.
Eukaryotic protein coding genes utilize a common set of general transcription factors to assemble RNA polymerase II at promoters. A long-standing question that biochemistry has attempted to explain is the order of assembly of the transcription machinery and what happens to individual components during multiple transcription cycles. As far as the general transcription machinery is concerned, in vitro ordered assembly starts with TBP followed by TFIIB, then pol II and TFIIF, and then TFIIE and TFIIH
In regards to the genome-wide distribution of the GTF's, we did not see a random partitioning of genes into high vs low occupancy states for each factor. Principal component analysis (PCA) indicates the presence of a single major component (not shown), and several minor ones. This would be consistent with the strong tendency of the GTF's to work together. What is interesting about the PCA is that TFIIB, pol II, TFIIF, and TFIIH were the main drivers in the first principal component, despite pol II having relatively low occupancy at the promoter region. TBP contributed the least to the principal components (
When clustering all GTF's and pol II, three high occupancy states stood out as having a large membership. These included genes with high levels of 1) all GTF's, 2) all GTF's except TBP, and 3) all GTF's except TBP and pol II. The group having high levels of all GTF's was by far the most highly transcribed, which is not surprising. This group included the ribosomal protein genes. However, for the major groups, low levels of TBP were more closely linked to low levels of transcription than the occupancy level of any of the other factors including pol II. This confirms on a genomic scale the earlier notion established on a few genes that TBP recruitment or retention is rate-limiting in transcription
While the number of dissociation mechanisms scale factorially (n!) with the number (n) of proteins involved, we did not see an equal distribution of genes into each type of mechanism, and we did not see a corresponding increase in the number of compatible dissociation mechanisms. Instead, the number of compatible mechanisms remained rather fixed at one to two, for a given association mechanism. The general pattern observed for most genes, was that if TBP, TFIIB, pol II, and the other GTFs assembled in the listed order, then the dissociation order was generally TBP, then pol II, then the other GTFs, with the latter being less resolved.
Factor occupancy data was obtained from ArrayExpress (
This scaling was necessary to compare occupancy levels across different factor datasets. In principle such scaling eliminates differences in crosslinking efficiencies and ChIP yields between factors. Fold-over-background values equal to or less than 1 represent background and thus were re-coded as 0% occupancy. Several limitations of the ChIP assay precluded accurate assessment of 100% occupancy. First, ChIP hybridization signals generally correlated with actual occupancy levels but were not tightly linked (see below), and so the maximum detected fold enrichment over background could not simply be set to 100%, inasmuch as the variance might be quite substantial. Second, ChIP assays do not measure absolute binding, and so even if the variance were eliminated, we could not be certain that the maximum detected level of binding represented 100% occupancy. Nonetheless, if all factors are held to the same standard, and data from groups of similarly behaving genes are aggregated, then approximations can be made. Therefore, we coded any value above the 99th percentile rank (top 200 probes) as 100% (setting the 100% mark to the upper 98th percentile gave essentially the same results). All remaining data were scaled between 0 and 100% occupancy by subtracting background (1.0) from all data, and dividing through by the value at the 99th percentile rank.
It is generally assumed that ChIP signals scale linearly with actual occupancy level. However, it is possible that a factor bound to one type of DNA sequence may crosslink more readily than when bound to a different sequence. To test the effect of underlying DNA sequence on crosslinking efficiency, we examined the distribution of TBP occupancy levels at each of the eight TATA box subtypes
To increase the robustness of the occupancy values, as well as focus the modeling on predominant patterns, we grouped genes in accordance with their occupancy level for each factor. Genes (probes) having a GTF occupancy below 10% were parsed into low (L) occupancy groups. All others were parsed into high (H) occupancy groups, resulting in 2n theoretically possible groups, where “n” is the number of GTFs being modeled. Parsing the data at a 15% cutoff, or into three groups (low, medium, high using the 10% and 20% for the low-medium and medium-high cutoffs, respectively) did not substantially alter the outcomes, and its main conclusions.
Groups having low membership do not represent predominant patterns and so were consolidated as follows: Groups having >100 genes were exempt from consolidation because they have substantial membership, and groups having <10 genes were required to be consolidated for lack of viable membership. Otherwise, if the membership of an existing group was split by more than a 4∶1 ratio when an additional factor was added to the model (e.g. from 2-factor models to 3-factor models), then the two resulting clusters were consolidated (i.e., not split; note that the label of the consolidated clusters was assigned the label of the larger cluster). The final occupancy median calculations can be found in
PathCom requires the user to enter occupancies of proteins in a tab-delimited text file followed by the name of the cluster line by line. In a header, before the occupancies are entered, users enter one-letter codes to denote protein identities (of the user's choice) followed by a number to indicate the order in which the proteins assemble (See
COPASI conducts chemical kinetic and stochastic simulations
Since each modeling run has a manual component and becomes computationally draining with a large number of factors, it became impractical to run COPASI to fully generate the compatibility charts for four or more factors. Nonetheless, we employed COPASI to spot check these charts, and found 100% agreement with PathCom.
Scatter plot showing the distribution of percent of maximally measured occupancy of TBP and TFIIB.
(0.16 MB TIF)
Scatter plots showing the occupancy level of each replicate. Also shown are two plots comparing the median percent occupancies of TBP and TFIIB in the four two-factor clusters using both the low and high density tiling array data.
(0.26 MB TIF)
All six possible three-factor assembly pathways are shown and their corresponding PathCom compatibility cluster plots are shown, detailing which possible disassembly pathways arise under each possible assembly pathway. See
(0.23 MB TIF)
Shown are the experimentally determined digital footprints of genes having the highest occupancy of TBP (with TATA-boxes). The bases boxed in red highlight the TATA-boxes. The lack of discernable footprints suggests that TBP does not fully occupy its most occupied sites.
(0.28 MB TIF)
The two strongest principal components in a Principal Components Analysis (PCA) done on the six general transcription factors. They are plotted to show each factor's relative contribution to the principal components.
(0.11 MB TIF)
Compatibility chart for three factor modeling using COPASI, in which the DNA concentration was reduced from 10 to 1.
(0.20 MB TIF)
PathCom code for Windows users.
(0.01 MB TXT)
PathCom code for Mac OSX users.
(0.01 MB TXT)
Instruction on how to use PathCom.
(0.52 MB DOC)
Principal Component Analysis (PCA) of the six GTF's
(0.01 MB XLS)
The results of chi-square testes on whether underlying TATA-sequence variation might have had any effect on the cross-linking efficiencies of TBP and TFIIB.
(0.04 MB XLS)
Median occupancy levels for gene groups
(0.03 MB XLS)
We thank Istvan Albert, Bryan Venters, and other members of the lab for many helpful discussions and guidance on this work. We thank Cizhong Jiang and Shinichiro Wachi for providing computational code.