Search
Advanced Search
Average Rating (0 User Ratings)
    • Currently 0/5 Stars.
    See all categories
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
    Rate This Article

Open Access

Research Article

Discovering Motifs in Ranked Lists of DNA Sequences

Eran Eden1*, Doron Lipson1, Sivan Yogev1,2, Zohar Yakhini1,3*

1 Computer Science Department, Technion, Haifa, Israel, 2 IBM Research Laboratories, Haifa, Israel, 3 Agilent Laboratories, Santa Clara, California, United States of America

Abstract

Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP–chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP–chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP–chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP–chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.

Author Summary

A computational problem with many applications in molecular biology is to identify short DNA sequence patterns (motifs) that are significantly overrepresented in a target set of genomic sequences relative to a background set of genomic sequences. One example is a target set that contains DNA sequences to which a specific transcription factor protein was experimentally measured as bound while the background set contains sequences to which the same transcription factor was not bound. Overrepresented sequence motifs in the target set may represent a subsequence that is molecularly recognized by the transcription factor. An inherent limitation of the above formulation of the problem lies in the fact that in many cases data cannot be clearly partitioned into distinct target and background sets in a biologically justified manner. We describe a statistical framework for discovering motifs in a list of genomic sequences that are ranked according to a biological parameter or measurement (e.g., transcription factor to sequence binding measurements). Our approach circumvents the need to partition the data into target and background sets using arbitrarily set parameters. The framework is implemented in a software tool called DRIM. The application of DRIM led to the identification of novel putative transcription factor binding sites in yeast and to the discovery of previously unknown motifs in CpG methylation regions in human cancer cell lines.

Introduction

Background

This paper examines the problem of discovering “interesting” sequence motifs in biological sequence data. A widely accepted and more formal definition of this task is: given a target set and a background set of sequences (or a background model), identify sequence motifs that are enriched in the target set compared with the background set.

The purpose of this paper is to extend this formulation and to make it more flexible so as to enable the determination of the target and background set in a data driven manner.

Discovery of sequences or attributes that are enriched in a target set compared with a background set (or model) has become increasingly useful in a wide range of applications in molecular biology research. For example, discovery of DNA sequence motifs that are overabundant in a set of promoter regions of co-expressed genes (determined by clustering of expression data) can suggest an explanation for this co-expression. Another example is the discovery of DNA sequences that are enriched in a set of promoter regions to which a certain transcription factor (TF) binds strongly, inferred from chromatin immuno-precipitation on a microarray (ChIP–chip) [1] measurements. The same principle may be extended to many other applications such as discovery of genomic elements enriched in a set of highly methylated CpG island sequences [2].

Due to its importance, this task of discovering enriched DNA subsequences and capturing their corresponding motif profile has gained much attention in the literature. Any approach to motif discovery must address several fundamental issues. The first issue is the way by which motifs are represented. There are several strategies for motif representation: using a k-mer of IUPAC symbols where each symbol represents a fixed set of possible nucleotides at a single position (examples of methods that use this representation include REDUCE [3], YMF [4,5], ANN-SPEC [6], and a hypergeometric-based method [7]) or using a position weight matrix (PWM), which specifies the probability of observing each nucleotide at each motif position (for example MEME [8], BioProspector [9], MotifBooster [10], DME-X [11], and AlignACE [12]). Both representations assume base position independence. Alternatively, higher order representations that capture positional dependencies have been proposed (e.g., HMM and Bayesian networks motif representations [13]). While these representations circumvent the position independence assumption, they are more vulnerable to overfitting and lack of data for determining model parameters. The method described in this paper uses the k-mer model with symbols above IUPAC.

The second issue is devising a motif scoring scheme. Many strategies for scoring motifs have been suggested in the literature. One simple yet powerful approach uses the hypergeometric distribution for identifying enriched motif kernels in a set of sequences and then expanding these motifs using an EM algorithm [7]. The framework described in this paper is a natural extension of the approach of [7]. YMF [4,5] is an exhaustive search algorithm which associates each motif with a z-score. AlignACE [12] uses a Gibbs sampling algorithm for finding global sequence alignments and produces a MAP score. This score is an internal metric used to determine the significance of an alignment. MEME [8] uses an expectation maximization strategy and outputs the log-likelihood and relative entropy associated with each motif.

Once a scoring scheme is devised, a defined motif search space is scanned (either heuristically or exhaustively) and motifs with significantly high scores are identified. To determine the statistical significance of the obtained scores, many methods resort to simulations or ad hoc thresholds. Several excellent reviews narrate the different strategies for motif detection and use quantitative benchmarking to compare their performance [1418]. A related aspect of motif discovery, which is outside the scope of this paper, focuses on properties of clusters and modules of TF binding sites (TFBS). Examples of approaches that search for combinatorial patterns and modules underlying TF binding and gene expression include [1923].

Open Challenges in Motif Discovery

One issue of motif discovery that is often overlooked concerns the partition of the input set of sequences into target and background sets. Many methods rely on the user to provide these two sets and search for motifs that are overabundant in the former set compared with the latter. The question of how to partition the data into target and background sets is left to the user. However, the boundary between the sets is often unclear and the exact choice of sequences in each set arbitrary. For example, suppose that one wishes to identify motifs within promoter sequences that constitute putative TFBS. An obvious strategy would be to partition the set of promoter sequences into target and background sets according to the TF binding signal (as measured by ChIP–chip experiments). The two sets would contain the sequences to which the TF binds “strongly” and “weakly,” respectively. A motif detection algorithm could then be applied to find motifs that are overabundant in the target set compared with the background set. In this scenario, the positioning of the cutoff between the strong and weak binding signal is somewhat arbitrary. Obviously, the final outcome of the motif identification process can be highly dependent on this choice of cutoff. A stringent cutoff will result in the exclusion of informative sequences from the target set while a promiscuous cutoff will cause inclusion of nonrelevant sequences—both extremes hinder the accuracy of motif prediction. This example demonstrates a fundamental difficulty in partitioning most types of data. Several methods attempt to circumvent this hurdle. For example, REDUCE [3] uses a regression model on the entire set of sequences. However, it is difficult to justify this model in the context of multiple motif occurrence (as explained below). In other work, a variant of the Kolmogorov-Smirnov test was used for motif discovery [24]. This approach successfully circumvents arbitrary data partition. However, it has other limitations such as the failure to address multiple motif occurrences in a single promoter, and the lack of an exact characterization of the null distribution. Overall, the following four major challenges in motif discovery still require consideration: (c1) the cutoff used to partition data into a target set and background set of sequences is often chosen arbitrarily; (c2) lack of an exact statistical score and p-value for motif enrichment. Current methods typically use arbitrarily set thresholds or simulations, which are inherently limited in precision and costly in terms of running time; (c3) a need for an appropriate framework that accounts for multiple motif occurrences in a single promoter. For example, how should one quantify the significance of a single motif occurrence in a promoter against two motif occurrences in a promoter? Linear models [3] assume that the weight of the latter is double that of the former. However, it is difficult to justify this approach since biological systems do not necessarily operate in such a linear fashion. Another issue related to motif multiplicity is low complexity or repetitive regions. These regions often contain multiple copies of degenerate motifs (e.g., CA repeats). Since the nucleotide frequency underlying these regions substantially deviates from the standard background frequency, they often cause false-motif discoveries. Consequently, most methods mask these regions in the preprocessing stage and thereby lose vital information that might reside therein; (c4) criticism has been made over the fact that motif discovery methods tend to report presumably significant motifs even when applied on randomly generated data [25]. These motifs are clear cases of false positives and should be avoided.

Data Lends Itself to Ranking in a Natural Manner

In this paper we describe a novel method that attempts to solve the above-mentioned four challenges in a principled manner. It exploits the following observation: data often lends itself to ranking in a natural manner, e.g., ranking sequences according to TF binding signal: ranking according to CpG methylation signal, ranking according to distance in expression space from a set of co-expressed genes, ranking according to differential expression, etc. We exploit this inherent ranking property of biological data in order to circumvent the need for an arbitrary and difficult-to-justify data partition. Consequently, we propose the following formulation of the motif finding task: given a list of ranked sequences, identify motifs that are overabundant at either end of the list.

Our solution employs a statistical score termed mHG (minimal hypergeometric) [26]. It is related to the concept of rank-imbalanced motifs, which are sequence motifs that tend to appear at either end of a ranked sequence list. In previous work [26], the authors used mHG to identify sequence motifs in expression data. We use this simple yet powerful approach as the starting point for our study.

Overview

The rest of this paper is divided into two main parts, each of which is self-contained: in the Results we briefly outline our method and describe new biological findings that were obtained by applying this method to biological data. We address challenge (c4) by testing the algorithm on randomly ranked real genomic sequences. In the Methods, we describe the mHG probabilistic and algorithmic framework and explain how we deal with challenges (c1)–(c3).

Results

Statistics and Algorithms in a Nutshell

Based on the mHG framework, we developed a software tool termed DRIM (discovery of rank imbalanced motifs) for motif identification in DNA sequences. A flow chart of DRIM is provided in Figure 1. The formal introduction and details of the mHG statistics are given in Methods. However, to facilitate the explanation and interpretation of our biological results, we begin with a brief description of the method.

thumbnail

Figure 1. DRIM Flow Chart

DRIM receives a list of DNA sequences as input and a criterion by which the sequences should be ranked, for example, TF binding signals as measured by ChIP ChIP–chip:

(i) The sequences are ranked according to the criterion.

(ii) A “blind search” is performed over all the motifs that reside in the restricted motif space (in this study the restricted motif space contains ~100,000 motifs, see Methods, The DRIM software). For each motif an occurrence vector is generated. Each position in the vector is the number of motif occurrences in the corresponding sequence, (the figure shows the vector for the motif CACGTGW).

(iii) The motif significance is computed using the mHG scheme, and the optimal partition into target and background sets in terms of motif enrichment is identified. The promising motif seeds are passed as input to the heuristic motif search model and the rest are filtered out.

(iv,v) The motif seeds are expanded in an iterative manner (the mHG is computed in each lap), until a local optimum motif is found.

(vi) The exact mHG p-value of the motif is computed. If it has a p-value < 10−3, then it is predicted as a true motif (the choice of this threshold is explained in Results, Proof of principle). The output of the system is the motif representation above IUPAC, its PSSM, mHG p-value, and optimal set partition cutoff.

doi:10.1371/journal.pcbi.0030039.g001

Suppose we are given a set of DNA sequences and some measured signal associated with each sequence. We rank the sequences according to the signal. Now, given a sequence motif, we wish to assess whether that motif tends to appear more often at the “top” of a list compared with the “remainder” of the list. The mHG score captures this type of motif significance. More precisely, the mHG score reflects the surprise of seeing the observed density of motif occurrences at the top of the list compared with the rest of the list under the null assumption that all configurations of motif occurrences in the list are equiprobable. A unique feature of the mHG statistics is that the cutoff between the top and the rest of the list is chosen in a data-driven manner so as to maximize the motif enrichment. This is done by computing the motif enrichment over all possible set partitions and identifying the cutoff at which maximal statistical significance is observed.

The search for this optimal cutoff introduces a multiple testing problem. To solve this without resorting to multiple testing corrections, which diminish the score's sensitivity, we provide a novel algorithm for computing the exact p-value of mHG scores (see Methods, Calculating the p-value of the mHG score). This eliminates the need to resort to simulations or exhaustively calculated tables.

Our method also includes a new approach to modeling motif multiplicity by incorporating a multidimensional hypergeometric framework (see Methods, Multidimensional mHG score). Unlike some models, which assume linearity (e.g., that two binding motifs have twice the binding capacity as one motif), our model does not make such pre-assumptions. Instead, the degree of surprise is adjusted for each motif according to its own occurrence multiplicity distribution.

DRIM scans through a motif space, computes the mHG p-value of these motifs and reports the significant ones (see Methods, The DRIM software).

Proof of Principle

We begin by testing our method on synthetically generated clear-cut positive and negative control cases. We do this to verify that DRIM accurately identifies motifs in well-characterized and experimentally verified examples and at the same time avoids false identification of motifs in randomly ordered genomic sequences. The latter objective is of particular importance since the issue of false identification has been mentioned as one of the main shortcomings of motif discovery approaches. For example, in a previous study, six different motif discovery applications were used to search for TFBS motifs [25]. Each of the programs attempted to measure the significance of its results using one or more enrichment scores. The authors report that the applications outputted high-scoring motifs even when applied to random selections of intergenic regions. A different paper reports clusters of genes whose expression patterns correlate to the expression of a particular TF [27]. These clusters were then analyzed for enriched motifs. Again, the authors report that random sets, with sizes matching those of the real clusters, contained a large number of motifs with significant scores.

To test our method's false-prediction rate, we performed the following negative control experiment: five different random permutations of ChIP–chip data were generated by randomly selecting 400 promoters and randomly permuting their ranks. DRIM was then applied to these ranked lists and scanned more than 100,000 different motifs in each one. None of the motifs that were scanned had a significant corrected mHG p-value <10−3. Note that to get the corrected p-values, two levels of multiple test corrections are performed: correcting for the number motifs that are tested; and correcting for multiple cutoffs that are tested as part of the mHG optimization process.

How do the p-values of random motifs compare with those of true biological motifs? To test this, we chose five TFs (BAS1, GAL4, CBF1, INO2, and LEU3) whose motif binding sites are well-characterized and experimentally verified. We applied DRIM to the ChIP–chip data of these TFs as reported in [25]. In all instances, the true motifs were identified with corrected p-values of 10−6, 10−9, 10−76, 10−18, and 10−8, respectively. A comparison of the p-value distribution of the motifs in the randomly ordered sequences with that of the verified TFBS motifs is given in Figure S3. In all instances the true TFBS motifs were predicted with p-values that were several orders of magnitude more significant than the best p-value of a motif in the randomly permuted data. This indicates that the enrichment signals of true TFBS, as captured by the mHG p-value, are clearly distinct from the signals we expect to find in random rankings of genomic sequences.

TFBS Prediction Using ChIP–chip

To further test the effectiveness of our method, we used it for identification of TFBS in yeast by applying it to the Harbison and Lee–filtered ChIP–chip datasets [25,28], containing measurements of 207 TF binding experiments in several conditions (for details regarding dataset-filtering see Methods). Interestingly, we observed that in many of these datasets longer intergenic regions are biased toward stronger TF binding. We elaborate on this sequence length bias in the Methods section and in Figure S1.

In each of the ChIP–chip experiments, we ranked the intergenic regions according to the TF binding signal (we use the p-value of enrichment for the sequence represented on the array). This was used as input for DRIM, which then searched for motifs that tend to appear densely at the top of the ranked lists. If such a motif does exist, with a p-value less than 10−3, then we hypothesize that it is biologically significant and that it contributes to the TF's binding, either directly or indirectly.

The results on the Harbison filtered dataset are summarized in Table S2. A TF was assigned a motif if such was found in at least one condition. We compared the DRIM predictions with previously reported TFBS discoveries in ChIP–chip that incorporated predictions of six other motif discovery methods and conservation data [25]. The results of this comparison are summarized in Figure 2.

thumbnail

Figure 2. Comparison between Predictions of DRIM and Published Predictions of Six Other Methods and Conservation Data as Reported in [25]

Overall, out of 162 unique TFs, DRIM identified significant motifs for 82 TFs with p-value <10−3. Out of the 162 TFs, DRIM and the other applications agree on 96 TFs: 27 TFs for which a similar motif was found and 69 TFs for which no significant motifs were found. There are five TFs for which the motifs predicted by DRIM and other applications differ; 11 for which the other applications identified motifs that DRIM did not; and 50 for which DRIM identified a motif that the other applications did not (for details see Tables S2 and S3). Sequence logos were generated using the RNA Structure Logo software [56].

doi:10.1371/journal.pcbi.0030039.g002

Overall, DRIM identified 50 motifs that were not picked up by the six other methods as reported in [25]. We further investigated these putative TFBS for additional evidence that they are biologically meaningful. First, we found that seven of them (ASH1, GCR1, HAP2, MET31, MIG1, RIM101, and RTG3) are in agreement with previously published results that are based on experimental techniques other than ChIP–chip. Second, we compared them with a list of conserved regulatory sites in yeast that was recently inferred using conservation-based algorithms [29]. Ten of our putative TFBS match these conserved sites (ARG81, ARO80, ASH1, CRZ1, DAL81, HAP2, IME1, MET31, MIG1, and RTG3). Taken together, these findings provide a strong indication that at least some of the new motifs identified by DRIM are true biological signals. In the following subsections, we focus on a few of these putative TFBS (see Figure 3) and present additional evidence that supports their biological role. We use these findings to discover new interactions in the yeast genetic regulatory network.

thumbnail

Figure 3. Examples of TFs for Which DRIM Identifies Novel Motifs

We further investigated these motifs and show evidence of their biological function. YPD, H2O2, and SM denote the ChIP–chip experimental conditions [25] in which the motifs were identified.

doi:10.1371/journal.pcbi.0030039.g003
Aro80 transcription regulatory network.

The Aro80 TF regulates the utilization of secondary nitrogen sources such as aromatic amino acids, as part of the Ehrlich pathway [30]. In particular, it is involved in the regulation of 2-phenylethanol, a compound with a rose-like odor, which is the most-used fragrance in the perfume and cosmetics industry [31]. Due to its commercial potential, the optimized production of this substance has received much attention [31].

We identified the remarkably large motif, WWNCCGANRNWNNCCGNRRNNW, in Aro80 rich media ChIP–chip data [25] with p-value < 10−11 (see Figure 3). We refer to this putative binding site as BSAro80. Furthermore, we discovered the same motif in two other independent sources of data: Aro80 rich media experiment in the Lee filtered dataset and Aro80 SM condition (amino acid starvation), both with p-value < 10−6. Only seven copies of this motif occurred in the entire yeast genome. These seven copies are distributed among four promoters, three of which have two copies of BSAro80 each. This unusual motif distribution is combinatorially surprising and therefore suggests biological significance. We note that BSAro8 shares some similarity with a previously reported Aro80 motif [29,32]. However, the sequence of BSAro8 provides new insights into the mechanism of the yeast Ehrlich pathway that cannot be explained by the previously described motif. (i) It was previously shown that Aro80 enhances the transcription of Aro9 and Aro10 [30,32]. We found BSAro80 in the promoters of both genes—two copies in each promoter. (ii) Interestingly, BSAro80 appears in the promoter of the gene coding to the Aro80 protein. Since the BSAro8 motif appears only in four promoters in the entire genome, it is highly unlikely that this occurred by chance. We therefore hypothesize that Aro80 self-regulates its own transcription by directly binding to its own promoter. (iii) The fourth promoter (when ranking according to Aro80 rich media ChIP–chip data [25]) contains two BSAro80 elements, one on the sense and the other on the anti-sense. This configuration is shared by two divergently transcribed genes, NAF1 and Esbp6. The latter gene was previously shown to have increased transcription in the presence of phenylalanine as sole nitrogen source [30], suggesting it may play a role in the Ehrlich pathway. Esbp6 is a monocarboxylate permease and might be involved in the transfer of substrates of the Ehrlich pathway across the plasma membrane. (iv) We analyzed the conservation of BSAro8 in four yeast strains and found all seven of its copies to be conserved in the different strains. (v) Aro80 belongs to the Zn2Cys6 family of TFs that are known to bind CCG elements separated by a spacing. Indeed, in addition to other conserved nucleotides, the motif contains CCG gapped tri-nucleotides. (vi) In a previous study, in order to identify cis-acting sequences involved in Aro9 induction, a series of deletions were produced in the Aro9 promoter region, and the expression of a reporter gene was monitored [32]. The authors concluded that the sequence CCGN7CCGN7CCGN7CCG in the Aro9 promoter is responsible for Aro80 binding. We note, however, that the changes in expression caused by the mutations can be interpreted differently, and in fact they are even more consistent with our BSAro80 motif. Deletions or mutations that simultaneously altered all motif copies in the promoter dramatically reduced expression, while those which altered only some of the copies caused a more mild decrease. Other deletions that did not affect any BSAro80 motif did not affect the expression at all. A detailed analysis of the BSAro80 element with respect to these mutagenesis studies is given in Figure S4.

A putative transcription network of Aro80 that incorporates these findings is shown in Figure 4. Note that GATA binding sites are found adjacent to the BSAro80 motif. We further discuss the potential role of these motifs in the Discussion.

thumbnail

Figure 4. The Hypothetical Regulatory Network of Aro80

Copies of the BSAro80 motif (on the sense and antisense) are shown as rectangles on the promoter regions.

(A) BSAro80 is conserved in four strains of yeast as shown using the University of California Santa Cruz browser conservation plots. Aro80 regulates the utilization of secondary nitrogen sources such as aromatic amino acids by binding genes that participate in the catabolism of aromatic amino acids. We hypothesize that it also binds to its own promoter region and introduces a positive feedback self loop.

(B) Part of the Aro80 promoter sequence is shown with bases of the BSAro80 motif colored in red. Interestingly, there are three GATA binding sites that are adjacent to the BSAro80 motif (bases colored in green). These sites bind GATA factors that are known to play a role in nitrogen catabolite repression. We hypothesize that they are also involved in the repression of Aro80 expression by physically binding to the region near BSAro80, thus making it inaccessible to Aro80 binding. This in turn breaks the positive feedback loop and represses the expression of Aro80 itself and other Aro80 regulated genes.

doi:10.1371/journal.pcbi.0030039.g004

The predicted motif BSAro80 exemplifies the usefulness of the mHG flexible cutoff. Our process partitioned the data into a target set containing the top first four promoters (the only promoters in the genome in which the motif resides) and a background set containing the rest of the promoters. Other methods that used a fixed binding signal cutoff (p-value < 10−3) for partitioning the data included 16 other promoters in the target set, in addition to the four promoters in which BSAro80 appears. Consequentially, the signal-to-noise ratio decreases, which might explain why other methods did not identify the BSAro80 element.

Taken together, our results suggest the predicted BSAro80 motif is indeed an Aro80 binding site.

CA repeats are correlated with TF binding.

We identified a bi-nucleotide CA repeat motif with variable length ranging from six to 62 in the Harbison filtered dataset. The CA repeat motif was found to be highly enriched for seven TFs: ARR1, GCR2, IME4, and ACE2 in rich media condition and AFT2, MAL33, and SFP1 in H2O2Hi condition. Furthermore, for two of these TFs (GCR2, IME4), we rediscovered the same CA repeat motif in the Lee filtered dataset. In other words, for the specified TFs, we identify a highly significant correlation between a sequence's capacity to bind the TF and the presence of a CA repeat in the sequence. This type of low complexity motifs are often filtered by current methods. One exception is a recent work in which a CACACACACAC sequence was found to be enriched in Rap1 experiments [33].

It has been previously hypothesized that CA repeats might have a functional role in TF binding [34]. It was proposed that CA repeats, which are often conserved in evolutionary distant organisms, are likely to impose a unique DNA tertiary structure that aids in the identification of other specific regulatory elements [34]. Our findings constitute concrete evidence to this phenomena in seven (of 82) different TFs. They are also in agreement with another study in which CA repeat–containing sequences in the human gamma-globin gene promoter required for efficient transcription were identified using in vitro site-directed mutagenesis [35]. Taken together, our findings and other observations suggest CA repeats play a role in the DNA binding of some TFs.

Detection of indirect TF–DNA binding using ChIP–chip.

IME1 is a TF that activates transcription of early meiotic genes. We identified a motif, CGGCCG, with p-value < 10−11 that is enriched in the sequences to which IME1 binds in H2O2 condition. Although this motif was not identified by other methods as reported in [25], we found evidence that suggests it is biologically meaningful. First, we note that this motif is a perfect palindrome, which is often characteristic of TF binding sites. Second, the same motif was identified as evolutionarily conserved in IME1-bound sequences as inferred from ChIP–chip data [29]. Third, IME1 interacts with Ume6, also a transcriptional regulator of early meiotic genes, which was previously shown to bind the same DNA motif, CGGCCG [36]. We conclude that the IME1-discovered motif is likely due to the following scenario: IME1 binds to Ume6, which binds to CGGCCG sequences on the DNA. The cross linking in the ChIP–chip protocol fixes these bindings, and the immunoprecipitation of IME1 actually precipitates the entire complex. We therefore get enriched CGGCCG sequences in IME1 experiments due to indirect binding to this DNA motif.

In another example, we identified the same two distinct motifs, M1 = TGTGGCSS and M2 = CACGTG, in rich media ChIP–chip experiments of three different TFs: Met4, Met31, and Met32. Furthermore, we rediscovered the same motifs in other experimental conditions of the same TFs. Met4, Met31, and Met32 are three factors involved in the sulfur amino acid pathway, and the fact that the same two motifs were independently predicted for each of the TFs is unlikely to occur by chance, suggesting the predictions are biologically meaningful. In a previous work it was shown that Met4 is tethered to the DNA sequence AAACTGTG via two alternative complexes, Met4-Met28-Met31 and Met4-Met28-Met32 (the binding is thought to occur via Met31/32) [37]. This sequence partially overlaps motif M1. Furthermore, the complex Met4-Met28-Cbf1 was shown to bind motif M2 [38]. Previous findings are summarized in Figure S5A. The above explains why we predict M1 for Met4 and M2 for Met31/32. However, it does not explain why we also predict M2 for Met4 and M1 for Met31/32. The most likely explanation for this is the existence of a direct interaction between the two complexes Met4-Met28-Cbf1 and Met4-Met28-Met31/32. If such an interaction exists, then the cross linking would fix the two complexes and cause the immunoprecipitation of Met4, Met31, and Met32 to precipitate the same set of sequences, thus causing the same motifs to appear in the experiments of all three TFs, which is exactly what DRIM identifies. This point is illustrated in Figure S5B. The idea of direct interaction between the two complexes is also in agreement with previous results [37].

Overall, the results shown in this subsection demonstrate that DRIM is able to identify previously ignored subtle signals in ChIP–chip data that stem from indirect bindings of factors to DNA. This type of information can be useful for inferring novel protein–protein interactions.

Condition-dependent motifs.

A comparison was made between the predicted motifs of the same TF in different experimental conditions (see Table S2). These seem to fall into two main categories: (i) motifs whose enrichment is condition-dependent, and (ii) motifs whose enrichment is condition-independent, suggesting the TF is bound to the DNA regardless of condition. In the latter, although the same motif was predicted in different conditions, the motif enrichment varied considerably. For instance, the GAL4 binding site CGGN11CCG, previously reported in [1] and other literature, was predicted in both YPD and galactose conditions. However, the enrichment varied considerably with p-values 10−7 and 10−11, respectively. This several-fold difference in enrichment is consistent with what is known about the role of GAL4 in galactose metabolism. It suggests that GAL4 has a preference to bind CGGN11CCG DNA regardless of condition. However, in the presence of galactose and absence of glucose, this preference becomes much more significant. Another example of a condition-invariant motif whose binding strength is subject to experimental condition is that of the Aro80 TF. This demonstrates that DRIM can be used not only to identify binding sites but also to distinguish between different modes of TF binding.

Motif Discovery in Human Methylated CpG Islands

To examine our method's ability to predict sequence motifs that stem from data other than TF binding, DRIM was applied to a dataset containing the human cancer cell line–methylated CpG islands (for dataset details, see Methods) to seek for motifs that are enriched in hypermethylated regions. The promoters were ranked according to methylation signal, with hypermethylated promoters at the top. Note that different replicates of the same cell line may yield different ranking of the promoters.

DRIM identified significantly enriched motifs in each of the four cancer cell lines. Table 1 shows all the motifs that were independently discovered in at least two different replicates of the same experiment or that are in agreement with previous work [2]. Overall, DRIM discovered 13 motifs: ten novel motifs and three that have been previously predicted in hypermethylated CpG island promoters in the same cancer cell lines [2]. Some of these motifs have also been independently identified in methylated CpG regions of other cell lines [39,40].

thumbnail

Table 1.

Enriched Motifs Associated with CpG Methylation in Four Human Cancer Cell Lines and Comparison to Motifs in Regions Bound by the Polycomb Complex

doi:10.1371/journal.pcbi.0030039.t001

Interestingly, nine of the novel ten motifs were independently identified in DNA regions to which the proteins of the Polycomb complex bind [4143]. The Polycomb complex is involved in gene repression through epigenetic silencing and chromatin remodeling, a process that involves histone methylation. The fact that these two distinct key epigenetic repression systems, namely histone methylation and CpG methylation, bind to regions that share a similar set of sequence motifs suggests they are linked. To further establish this link we applied DRIM to Polycomb complex bound promoters in human embryonic fibroblasts [44]. We found four motifs that are similar to the CpG methylation motifs (Table 1). Our findings are consistent with a recent paper that showed that the EZH2 Polycomb protein binds methyltransferases via the Polycomb complex [45].

Most of the motifs we found are similar across more than one type of cancer cell line, e.g., variants of the GCTGCT motif appear in Caco-2, PC3, and Polyp1 cancer cell lines. This suggests that the same DNA binding factors are involved in CpG methylation of different types of cancers. It is also important to note that some of the motifs we discovered are G–C rich. The enrichment of these motifs may be partially attributed to the G–C content bias that is found in CpG methylation data.

The DRIM motif identification process can be used not only to identify novel motifs but also to partition the data in a biologically meaningful manner. In [2] the authors used a fixed threshold on the methylation signal (p-value < 0.001 ) to partition the dataset. Consequently, they identified 135 hypermethylated promoters. A data-driven partition would be to use the threshold that yielded the maximal motif enrichment. For example, in the Caco2 cell line, we identified the same motif as in the previous work [2]. However, the motif maximal enrichment was found in the top 209 promoters (an increase of 54% in target set size).

Motif Discovery in Human ChIP–chip Data

Human TFBS tend to be longer and “fuzzier” than TFBS of lower eukaryotes, and it is important to evaluate our method's performance on such motifs. To this end, we applied DRIM to the ChIP–chip experiments of HNF1α, HNF4α, HNF6 in liver and pancreas islets [46], as well as to that of CREB [47]. For each of the TFs, we generated a list of sequences containing 1,000 bases upstream and 300 downstream of the transcription start site (TSS). We ranked the list according to the TF ChIP–chip signal and used it as input to DRIM. DRIM successfully detected the TFBS of these TFs that are reported in TRANSFAC with extremely significant p-values: HNF1α liver—GTTAMWNATT (p = 10−8), HNF4α Islets—SCGGAAR (p = 10−53), HNF6 Liver—ATCRAT (p = 10−57), and HNF6 Islets—ATCRAT (p = 10−61). In the CREB experiments we identified the palindromic motif TGACGTCA (p = 10−16), which is known to bind CREB [47].

Comparison with Other Methods

Three properties of the mHG enrichment score embodied in DRIM offer advantages over other motif discovery methods: the dynamic cutoff, the rigorous control over false positives, and the motif multiplicity model.

Dynamic versus rigid cutoffs.

Most methods use an arbitrary cutoff for set partition. For example, in previous work [25] the authors use a cutoff of p-value < 10−3 on the ChIP–chip signal in order to define the target set for motif searching. In contrast, the mHG score uses a data-driven flexible cutoff and chooses the set partition that maximizes the motif enrichment.

To more systematically investigate the advantages of using a flexible cutoff, we compared mHG with fixed set partition HG [7] by disabling the flexible cutoff feature in DRIM. The comparison was performed on ChIP–chip data of TFs whose motif binding sites are well-characterized as well as on the Aro80 binding site we identified. For each TF, we ranked the sequences according to the ChIP–chip binding signal, generated the motif occurrence vector, and computed its HG enrichment using fixed target sets containing the top 10, 100, and 1,000 sequences as well as all sequences with ChIP–chip signal <10−3. The results are summarized in Figure 5. We note that all of the scores are corrected for multiple-motif testing. The mHG score is also corrected for the multiple-cutoff testing. The mHG method yields superior results in all six cases.

thumbnail

Figure 5. Comparison between HG and mHG Enrichment

The mHG and HG methods were applied to ChIP–chip data of six TFs. The sequences were ranked according to the ChIP–chip binding signal, and the enrichment of the correct binding motif was recorded using mHG and HG with fixed target sets containing the top 10, 100, and 1,000 sequences as well as all sequences with ChIP–chip signal <10−3. All scores were corrected for multiple motif testing. The mHG score is also corrected for the multiple cutoff testing. The 10−3 and mHG cutoffs for each experiment are shown. It can be seen that the two cutoffs are significantly different and that for all the tested TFs mHG produces better results than HG in terms of enrichment of the true motif.

doi:10.1371/journal.pcbi.0030039.g005

We performed additional comparisons of the mHG versus the HG methods by applying both methods to simulations of motif occurrence vectors (see Text S7 and Figure S6). In these simulations mHG showed significantly better performance than HG.

To further investigate the issue of setting a cutoff, we compare, for a given TF and condition in the ChIP–chip dataset, the number of promoters for which the binding signal <10−3 (denoted #(10−3)) with the number of promoters at which mHG was attained (denoted n*). For 82 experiments, #(10−3) ≤ 4 and for 46 of these #(10−3) = 0. In these cases a 10−3 fixed cutoff reduces the size of the target set and limits the usability of any discovery algorithm. In Figure 6 we compare #(10−3) and n* for some of the cases at which a motif was found by mHG. Note that in a significant number of cases the mHG score identified a significantly enriched motif even when #(10−3) was very low. One extreme case is the TF SOK2 in YPD condition for which#(10−3) = 0, yet mHG found a significantly enriched motif.

thumbnail

Figure 6. Comparison of the Target Sets Sizes as Determined by the Fixed versus the mHG Flexible Cutoffs

Each dot represents a ChIP–chip experiment where the x and y coordinates are the number of promoters with p < 10 (standard cutoff) and the number of promoters as determined by the mHG cutoff, respectively. The dotted line is x = y. TF names are given in Table S4.

doi:10.1371/journal.pcbi.0030039.g006
Controlling false positives.

The second advantageous property of the mHG score is its ability to rigorously control false positives, due to calculation of an exact p-value. This attribute is best demonstrated by comparing the performance of DRIM versus other motif-finding tools on negative controls, that is, datasets whose original ranking was randomly permuted. It is clear that in these cases we should not find significantly enriched motifs. To this end we used the same benchmark on which DRIM was tested (see Results, Proof of principle). Using the same five random permutations of ChIP–chip data, we applied the algorithms AlignACE [12], MEME [8], and MDscan [33] on each of the random sets. Both AlignACE and MEME reported significant motifs with many A's, probably due to the existence of polyA tails in the intergenic regions. MDscan was used with a precompiled background from yeast intergenic regions, and therefore it did not report the polyA motifs, yet it did report motifs including repeats of TA, probably as a result of TATA boxes. In comparison, DRIM did not identify any significant motifs in any of the random sets.

Binary versus multidimensional enrichment.

The third advantageous property is the extension of the binary enrichment analysis to the multidimensional enrichment analysis (see Methods, Multidimensional mHG score). The latter forms the basis for dealing with motif multiplicity in a data-driven manner. To test this property, we compared DRIM, which uses the multi-mHG framework, with a restricted version of DRIM, which uses the standard binary enrichment framework. Out of 31 binding motifs identified by DRIM that were also identified in other literature, the restricted version was able to identify only 23. Furthermore, in some instances, both methods were able to identify the correct motif site; however, the motif significance using the multi-mHG framework was several fold more significant without incurring additional false predictions.

Discussion

In this paper we examine the problem of discovering “interesting” motif sequences in biological sequence data. While this problem has often been regarded as tantamount to discovering enriched motifs in a target set versus a background set, we point out an inherent limitation to this formulation of the problem. Specifically, in most cases, biological measurement data does not lend itself to a single, well-substantiated partition into target and background sets. It does, however, lend itself to ranking in a natural manner. Our approach exploits this natural ranking and attempts to solve challenges (c1)–(c4) (see Introduction, Open challenges in motif discovery).

To address challenge (c1), instead of choosing an arbitrary cutoff for set partition, we search for a cutoff that partitions the data in a way that maximizes the motif enrichment. We present evidence that shows that the flexible mHG cutoff outperforms the rigid cutoff. One example of this is shown in Figure 5, where the flexible cutoff yields better results for all the tested TFs. Another example of the advantage of a flexible cutoff is the two motifs detected in three TFs involved in the sulfur amino acid pathway (Met4, Met31, and Met32). Figure 7 shows the number of motif occurrences in each of the top 59 promoters that were ranked according to Met32 binding signal (data from [25]). The motifs are highly frequent in the top 18 promoters, after which a strong drop in motif frequency is observed. DRIM identifies this, and partitions the set accordingly. In comparison, relying on the standard cutoff of 10−3 results in a target set of the top 48 promoters, most of which do not contain this motif. The signal-to-noise ratio is thus diminished, which may explain why these motifs were previously overlooked.

thumbnail