Conceived and designed the experiments: EV AS OW AK. Performed the experiments: EV. Analyzed the data: EV. Wrote the paper: EV AS OW AK.
The authors have declared that no competing interests exist.
A major goal in post-genome biology is the complete mapping of the gene regulatory networks for every organism. Identification of regulatory elements is a prerequisite for realizing this ambitious goal. A common problem is finding regulatory patterns in promoters of a group of co-expressed genes, but contemporary methods are challenged by the size and diversity of regulatory regions in higher metazoans. Two key issues are the small amount of information contained in a pattern compared to the large promoter regions and the repetitive characteristics of genomic DNA, which both lead to “pattern drowning”. We present a new computational method for identifying transcription factor binding sites in promoters using a discriminatory approach with a large negative set encompassing a significant sample of the promoters from the relevant genome. The sequences are described by a probabilistic model and the most discriminatory motifs are identified by maximizing the probability of the sets given the motif model and prior probabilities of motif occurrences in both sets. Due to the large number of promoters in the negative set, an enhanced suffix array is used to improve speed and performance. Using our method, we demonstrate higher accuracy than the best of contemporary methods, high robustness when extending the length of the input sequences and a strong correlation between our objective function and the correct solution. Using a large background set of real promoters instead of a simplified model leads to higher discriminatory power and markedly reduces the need for repeat masking; a common pre-processing step for other pattern finders.
In the years following the sequencing of the human genome focus have shifted towards trying to understand how this blueprint results in the diversity of cells that we observe. Part of the answer lies in the regulation of transcription and how the proteins responsible for this recognize where they should attach to the DNA. This is a well studied problem, but most methods developed for this have a hard time dealing with the heterogeneity of the mammalian genomes. Here we present a method that greatly improves the efficiency of this search by contrasting the DNA with a large number of background DNA sequences. This enables us to handle repetitive segments of the genome that may be functional, but are usually considered intractable by most methods.
The rapid emergence of experimental techniques that can probe for functional elements at whole-genome scales
There are two main avenues used to attack this problem: i) enumerative algorithms based on word counting, such as
A typical instance of motif discovery starts with a set of upstream promoter regions of co-expressed genes suspected to be co-regulated and by extension more likely to be under control by the same regulatory machinery. This set is called the “positive set” and most methods proceed from here by locating motifs that are in some way statistically overrepresented in this set. The most successful applications of motif discovery have been in organisms whose regulatory information is densely aggregated around transcription start sites, such as
Most methods operate with some notion of a background model describing “generic DNA” against which the over-representation is measured. The model is often a multinomial or a Markov model. The choice of model is important for obtaining good results
Discriminatory motif searching is not a new idea; several methods have been developed that take advantage of a negative set
Specifically, we use conditional maximum likelihood to estimate the WMs and their thresholds such that the probability of the positive and negative sets is maximized (see
This conditional likelihood leads to a non-trivial optimization problem which is handled using simulated annealing (see
We evaluated our method by comparing its accuracy to a set of widely used motif discovery methods (MEME
In line with the recommendations of
The average site performance (lines) and the nucleotide correlation coefficient (bars) of the methods.
The relationship between our objective function and the correct solution was assessed by plotting the MoAn scores against the sensitivity obtained in all five runs on each of the 84 sets (not just the best from each run) (
All 5 runs on the 84 synthetic sets are used.
This finding is important, because it indicates that the raw score is an indication of quality independent of the motif analyzed. It also shows that choosing the best scoring run of several will often give the best result.
Aside from the problem with decreasing sensitivity as the length of the input sequences increase, repetitive sequences represent a severe problem for motif discovery, as these will often seem to be over-represented, and therefore it is common to mask these repeats. However, masking is always arbitrary, and some repeats are functional
The average site performance (lines) and the nucleotide correlation coefficient (bars) of MoAn with repeats planted in the two sets.
Evaluation of methods on real data is difficult and often a poor indication of general performance due to lack of insight into the correct solution
MoAn and four other methods were run on a collection of real data sets consisting of the binding sites of four human and mouse factors from the PAZAR database
The average site performance (lines) and the nucleotide correlation coefficient (bars) of the methods.
We speculate that the reason for this is that the background and foreground of the synthetic sets are essentially sampled from the same pool (RefSeq promoters), while we have made no effort to customize the background for the PAZAR sets. If the genomic environment of the factors differ from normal promoter sequences this could lead to a reduced performance. There are also fewer sets (7 versus 12) in this evaluation leading to a higher variability.
We report additional trials using ChIP-chip data in supplementary material (
An additional aspect of the motif finding problem is that TFs often work by forming complex interactions
The average site performance (lines) and nucleotide correlation coefficient (bars) of co-occurrence and serial runs on 5 different sets with co-occurring motifs.
In this work we have shown the value of using a large negative set instead of a pre-defined background model in motif discovery. Using raw sequences more accurately portrays the background than any general model and therefore higher discriminatory power is achieved. This method is also much less sensitive to “pattern drowning” in larger sequences, which is a bottleneck in computational analysis of mammalian regulatory regions. However, while our method takes a significant step towards routine motif discovery on large sequences, the problem cannot be considered fully solved. In particular, MoAn accuracy may be further improved by incorporating information on evolutionary constraints (phylogenetic footprinting)
In our opinion DEME is the best runner up of the methods. It often predicts the correct motif and has a high sensitivity, but often at the cost of a large number of false positives as it predicts also in those sequences not containing a site. MoAn seems to be better at balancing the sensitivity and specificity. On the other hand DEME is also given an artificial advantage by having the correct motif length as input and it is uncertain how advantageous this is.
Weeder performed surprisingly poorly given its stellar performance in a recent evaluation
A concern that might be raised is that optimizing a cutoff might lead to a conservative estimate of binding sites at the expense of weaker sites. However, assessing this is hard since experiments have their own thresholds in the post-analysis and any evaluation of MoAn's threshold will be dependant upon those. Investigations where we artificially forced the cutoff to remain low, lead to a reduction in performance (data not shown). We address this potential problem indirectly by providing a matrix that can be used to search sequences at a lower threshold.
Future improvements of MoAn will focus on the optimization algorithm, which currently is not robust enough to always produce reliable results. In our current implementation we avoid this problem by running the algorithm many times to see that the solution is stable.
Evaluation is done on both site and nucleotide levels. The statistics used are similar to those in the recent large scale evaluation
nTP Number of nts part of a site correctly predicted.
nTN Number of background nts correctly predicted.
nFP Number of background nts predicted to be part of a motif.
nFN Number of nts part of a site predicted as background.
sTP Number of real sites that share over 50% of its nts with a predicted site.
sFP Number of predicted sites that share less than 50% of its nts with a real site.
sFN Number of real sites that share less than 50% of its nts with a predicted site.
Note that we are more conservative with respect to the site prediction than
Derived from the basic statistics:
A sequence
The aim is to discriminate between two sets of sequences
For efficiency reasons, we do not calculate the score in its entirety. We assume that it is the strong sites that contribute the most to the equation and introduce a cutoff for each WM on the minimum score of a site. This enables an efficient search in the ESA. This is not without biological merit since WM scores and binding energies for known TFs are correlated, and at some point the binding energies of a TF and a poor binding sequence must be too small to matter
To find the WMs that best explain the difference in occurrence between the sets we use a discriminative objective function based on the probability of the labels
This is the logistic likelihood function for binary classification, see e.g.
We refer to this function as the (log likelihood) score,
Based on the sequence density
We observe that the prior probability of
A very high threshold will give no matches, and the probability will then be a constant given by the priors and the size of the two sets. Matches that score above the threshold in the negative set will lower the score and matches above the threshold in the positive set will increase the score, so the game is to obtain as many high-scoring matches in the positive set as possible without introducing too many matches in the negative set.
The prior is conservative in our runs in that we are strict about promoting hits in the positive set, but only moderately strict about disallowing negative hits. For a single matrix the prior on
In the evaluation we deliberately chose a probability of having a site (0.5) in a sequence very different from the model prior (0.99) to avoid giving our own method a big advantage. It shows that the method is not very sensitive to the choice of prior.
The objective function outlined above is optimized using simulated annealing
The temperature parameter is lowered for each iteration using as default an exponential cooling scheme (for details see
Candidate solutions are proposed by applying one of several steps outlined in the list below. In the case of multiple matrices, only one is changed at a time. We perform all steps on a integer “count” matrix which is then translated into a log-odds WM prior to searching the ESA, but notice that the “count” matrix does not represent actual letter frequencies in the selected sites. The steps are:
Alter the contents of the WM columns by moving counts from one random cell to another within a column. The number of counts moved is selected uniformly from 1 to the current count number for the cell.
Extend the WM in either direction. A uniformly sampled number of columns (1 to 5) is added and counts of these are decided by consulting the sequence locations of hits scoring above
Decrease the length of the WM by deleting columns. Similarly to adding columns a uniformly selected number between 1 and 5 columns are deleted.
Slide the WM across the sequences. Columns are deleted on one site and extended on the other according to the two steps above.
Alter the cutoff
Note that for the extend and decrease step there is a minimum and maximum number of columns for a motif. The default for these are 5 and 15 respectively.
The matrix is initialized with random counts and the cutoff is also selected uniformly according to the last step in the list above. Termination of the optimization is only based on the number of iterations which is by default set to a rather conservative value of 30 million iterations. Time requirements for a single run is variable depending on the set size, but was for our runs comparable to NestedMICA (single threaded) and considerably faster than Weeder's “large” run and DEME.
Source code as well as data sets is freely available at the author's web site:
Correlation of MoAn's objective function (Sc) and nucleotide correlation coefficient (nCC)
(0.01 MB EPS)
Evaluation with decoy motifs. Average site performance (lines) and the nucleotide correlation coefficient (bars) of MoAn with decoy motifs planted in the two sets.
(0.01 MB EPS)
Discriminatory power of matrices. ROC curve showing discriminatory power of matrices produced by MoAn and NestedMICA on the ESR1 data set. The line extends from the highest cutoff possible for that matrix (bottom right) to a cutoff of 0 (top left).
(0.03 MB EPS)
Performance on individual sets for MoAn. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets.
(0.02 MB EPS)
Performance on individual sets for DEME. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets.
(0.02 MB EPS)
Performance on individual sets for MEME. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets.
(0.02 MB EPS)
Performance on individual sets for Weeder. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets.
(0.02 MB EPS)
Performance on individual sets for NestedMICA. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets.
(0.02 MB EPS)
Data set construction
(0.03 MB PDF)
Running parameters
(0.03 MB PDF)
Tompa assessment
(0.03 MB PDF)
Sequences spiked with decoy motifs
(0.02 MB PDF)
Annealing schedule
(0.03 MB PDF)
PAZAR data sets
(0.03 MB PDF)
ChIP-chip data sets
(0.04 MB PDF)
Length of upstream and downstream extensions
(0.01 MB PDF)
Motifs planted in single occurrence sets
(0.04 MB PDF)
Motifs planted in co-occurrence sets
(0.03 MB PDF)
Results on the mammalian subset of the Tompa assessment
(0.01 MB PDF)
Sizes of PAZAR data sets
(0.01 MB PDF)
Sizes of ENCODE data sets
(0.01 MB PDF)
Performance on ENCODE data sets
(0.07 MB PDF)
Thanks to Brian Parker for help with the manuscript.