Search
Advanced Search
Average Rating (0 User Ratings)
    • Currently 0/5 Stars.
    See all categories
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
    Rate This Article

Open Access

Research Article

CSMET: Comparative Genomic Motif Detection via Multi-Resolution Phylogenetic Shadowing

Pradipta Ray#, Suyash Shringarpure#, Mladen Kolar, Eric P. Xing*

School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America

Abstract

Functional turnover of transcription factor binding sites (TFBSs), such as whole-motif loss or gain, are common events during genome evolution. Conventional probabilistic phylogenetic shadowing methods model the evolution of genomes only at nucleotide level, and lack the ability to capture the evolutionary dynamics of functional turnover of aligned sequence entities. As a result, comparative genomic search of non-conserved motifs across evolutionarily related taxa remains a difficult challenge, especially in higher eukaryotes, where the cis-regulatory regions containing motifs can be long and divergent; existing methods rely heavily on specialized pattern-driven heuristic search or sampling algorithms, which can be difficult to generalize and hard to interpret based on phylogenetic principles. We propose a new method: Conditional Shadowing via Multi-resolution Evolutionary Trees, or CSMET, which uses a context-dependent probabilistic graphical model that allows aligned sites from different taxa in a multiple alignment to be modeled by either a background or an appropriate motif phylogeny conditioning on the functional specifications of each taxon. The functional specifications themselves are the output of a phylogeny which models the evolution not of individual nucleotides, but of the overall functionality (e.g., functional retention or loss) of the aligned sequence segments over lineages. Combining this method with a hidden Markov model that autocorrelates evolutionary rates on successive sites in the genome, CSMET offers a principled way to take into consideration lineage-specific evolution of TFBSs during motif detection, and a readily computable analytical form of the posterior distribution of motifs under TFBS turnover. On both simulated and real Drosophila cis-regulatory modules, CSMET outperforms other state-of-the-art comparative genomic motif finders.

Author Summary

Functional turnover of transcription factor binding sites (TFBSs), such as whole-motif loss or gain, are common events during genome evolution, and play a major role in shaping the genome and regulatory circuitry of contemporary species. Conventional methods for searching non-conserved motifs across evolutionarily related species have little or no probabilistic machinery to explicitly model this important evolutionary process; therefore, they offer little insight into the mechanism and dynamics of TFBS turnover and have limited power in finding motif patterns shaped by such processes. In this paper, we propose a new method: Conditional Shadowing via Multi-resolution Evolutionary Trees, or CSMET, which uses a mathematically elegant and computationally efficient way to model biological sequence evolution at both nucleotide level at each individual site, and functional level of a whole TFBS. CSMET offers the first principled way to take into consideration lineage-specific evolution of TFBSs and CRMs during motif detection, and offers a readily computable analytical form of the posterior distribution of motifs under TFBS turnover. Its performance improves upon current state-of-the-art programs. It represents an initial foray into the problem of statistical inference of functional evolution of TFBS, and offers a well-founded mathematical basis for the development of more realistic and informative models.

Introduction

Phylogenetic shadowing techniques based on probabilistic molecular evolution models have been widely used in various comparative genomic analyses to uncover sequence entities believed to be conserved across species [1][4]. It is nothworthy that in the literature, the term “Phylogenetic Shadowing” has sometimes been (unnecessarily) narrowed down to refer to methods tailored specifically to the case of analyzing extremely closely related species, after its successful application to functional annotation of the primate genomes [1]. Here we adopt a more general interpretation reflecting the long-standing evolutionary principles and inferential technique underlying such analysis, rather than the choice of the study subjects. It refers to the class of methods that treat evolutionarily related entities as outcomes of some stochastic processes structured as a phylogeny, whereby the relationships between the studied entities can be inferred and utilized to unravel their underlying characteristics of interest. Typically, extant phylogenetic shadowing methods employ either a nucleotide (nt) or an amino-acid (aa) substitution process to model the evolution of orthologous entities, such as genes or proteins of interest at every nt or aa site. There are two key assumptions underlying the basic form of these approaches. 1) The orthology of the sequence entities across taxa, as captured by a multiple sequence alignment, is complete in the sense that there is no functional turnover of the aligned entities (e.g., no loss or gain of a gene) in any of the taxa; so that all aligned sequences can be modeled as descendants of a common ancestor following a single evolutionary tree model unique to the function (e.g., either gene or background) of the sequence entities. 2) Every site in the same entity evolves independently. Although not realistic, such a complete and independent shadowing model can lead to efficient algorithms for scoring aligned sequences; and in practice it works well for modeling large and highly conserved functional entities such as gene coding regions in phylogenetically closely related taxa, and it has led to a number of successful comparative genomic gene finders [5][7].

Unlike genes, where functional turnover usually occurs only in distant species and the complete orthology assumption is largely satisfied when sequences are aligned across phylogenetically closely related species, short and degenerate sequence patterns such as transcription factor (TF) binding sites (i.e., motifs) exhibit frequent turnover even across closely related taxa, such as various fruit fly species [8] (Figure 1). As we will discuss shortly, the functional heterogeneity of aligned regions across different taxa due to motif turnover often renders the conventional phylogenetic shadowing models inappropriate for comparative genomic motif finding. Some recent methods combine scoring functions modified from classic molecular evolution models with more flexible heuristic partial alignment search, and exhibit better sensitivity to non-conserved motifs [9],[10], but they offer little insight into the evolutionary dynamics of motif turnover and can have substantial computational complexity. In this paper, we present a principled approach that addresses the “incomplete orthology” issue arising from either functional gain/loss such as motif turnover or imperfect sequence alignment. We propose a new algorithm for searching binding sites of given TFs in multiple genomes based on a novel multi-resolution evolutionary model named CSMET. CSMET stands for Conditional Shadowing via Multi-resolution Evolutionary Trees. It explicitly models motif turnover across species through a “low resolution” phylogeny defined by a functional substitution process. Conditioning on the motif turnover states, which specify the presence or absence of TFBS functionality in each taxon, at any given location, specific “high resolution” phylogenies defined by function-specific nucleotide substitution processes are applied to different subsets (corresponding to taxa with different turnover status) of the aligned sequences at the attendant location. The model thereby captures function-specific sequence evolution in every taxon rather than subjecting all taxa to the same phylogeny as in the conventional model (Figure 2).

thumbnail

Figure 1. A demonstration of motif turnover.

(A) Two examples of multiple alignments of Drosophila CRMs, showing functional turnover in known TFBSs. The first one (top) shows an instance of binding site loss in D. ananassae, the motif in question being Caudal, in the Hairy 6 CRM. The second one (bottom) shows more instances of TFBS loss/gain. This example depicts a turnover with only melanogaster, simulans, and sechellia retaining the binding site functionality. (B) Putative TFBSs in eve2 enhancer across 4 taxa: D. melanogaster, D. yakuba, D. erecta and D. pseudoobscura. (Extracted and modified from Figure 4 in [11].) Notice that orthologs of melanogaster motifs bcd-3 and hb can not be identified from some of the other taxa.

doi:10.1371/journal.pcbi.1000090.g001
thumbnail

Figure 2. Diagrams showing the underlying generative models underlying basic phylogenetic shadowing approaches and the CSMET approach.

(A) The basic mixture of full-phylogeny model underlying PhyloHMM and EMnEM, where functional homogeneity across aligned sequences is assumed, and all aligned taxa (i.e., rows) are either under a full motif phylogeny (when Zt = 1) or a full background phylogeny (when Zt = 0). (B) The conditional shadowing model underlying CSMET, with an explicit evolutionary model Tf for species-specific functional turnover, and partial motif or background phylogenies over subsets of taxa according to the turnover status. See the Results section for explanations of the notations.

doi:10.1371/journal.pcbi.1000090.g002

Comparative Genomic Motif Search Under Incomplete Orthology

We concern ourselves with uncovering motifs in eukaryotic cis-regulatory modules (CRM) from multiple evolutionarily related species, such as the members from the Drosophila clade. Due to high degeneracy of motif instances, and complex motif organization within the CRMs, pattern-matching-based motif search in higher eukaryotes remains a difficult problem, even when representations such as the position weight matrices (PWMs) of the motifs are given. Extant methods that operate on a single genome or simpler organisms such as yeast often yield a large number of false positives, especially when the sequence to be examined spans a long region (e.g., tens of thousands of bps) beyond the basal promoters, where possible CRMs could be located. As in gene finding, having orthologous sequences from multiple evolutionarily related taxa can potentially benefit motif detection because a reasonable alignment of these sequences could enhance the contrast of sequence conservation in motifs with respect to that of the non-motif regions, However, the alignment quality of non-coding regions is usually significantly worse than that of the coding regions, so that the aligned motif sequences are not reliably orthologous. This is often unavoidable even for the best possible local alignment software because of the short lengths and weak conservation of TFBSs. When applying a standard shadowing model on such alignments, motif instances aligned with non-orthologous sequences or gaps can be hard to identify due to low overall shadowing score of the aligned sequences (Figure 1A).

In addition to the incomplete orthology due to imperfect alignment, a more serious concern comes from a legitimate uncertainty over the actual functional orthology of regions that are alignment-wise orthologous.

A number of recent investigations have shown that TFBS loss and gain are fairly common events during genome evolution [8],[12]. For example, Patel et al [13] showed that aligned “motif sites” in orthologous CRMs in the Drosophila clade may have varying functionality in different taxa. Such cases usually occur in regions with reduced evolutionary constraints, such as regions where motifs are abundant, or near a duplication event. The sequence dissimilarities of CRMs across taxa include indel events in the spacers, as well as gains and losses of binding sites for TFs such as the bcd-3 and hb-1 motifs in the evenskipped stripe 2 (eve2) (Figure 1B). A recent statistical analysis of the Zeste binding sites in several Drosophila taxa also revealed existence of large-scale functional turnover [12]. Nevertheless, the fact that sequence similarity is absent does not necessarily mean that the overall functional effect of the CRM as a whole is vastly different. In fact, for the Drosophila clade, despite the substantial sequence dissimilarity in gap-gene CRMs such as eve2, the expression of these gap genes shows similar spatio-temporal stripe patterns across the taxa [8],[13].

Although a clear understanding of the evolutionary dynamics underlying such inter- and intra-taxa diversity is still lacking, it is hypothesized that regulatory sequences such as TFBSs and CRMs may undergo adaptive evolution via stabilizing selections acting synergistically on different loci within the sequence elements [8],[12], which causes site evolution to be non-iid and non-isotropic across all taxa. In such a scenario, it is crucial to be able to model the evolution of biological entities not only at the resolution of individual nucleotides, but also at more macroscopic levels, such as the functionality of whole sequence elements such as TFBSs over lineages. To our knowledge, so far there have been few attempts along this line, especially in the context of motif detection. The CSMET model presented in this paper intends to address this issue.

Related Work

Orthology-based motif detection methods developed so far are mainly based on nucleotide-level conservation. Some of the methods do not resort to a formal evolutionary model [14], but are guided by either empirical conservation measures [15][17], such as parsimonious substitution events or window-based nucleotide identity, or by empirical likelihood functions not explicitly modeling sequence evolution [4],[18],[19]. The advantage of these non-phylogeny based methods lies in the simplicity of their design, and their non-reliance on strong evolutionary assumptions. However, since they do not correspond to explicit evolutionary models, their utility is restricted to purely pattern search, and not for analytical tasks such as ancestral inference or evolutionary parameter estimation. Some of these methods employ specialized heuristic search algorithms that are difficult to scale up to multiple species, or generalize to aligned sequences with high divergence.

Phylogenetic methods such as EMnEM [20], MONKEY [21], and our in-house implementation of PhyloHMM (originally implemented in [1] for gene finding, but in our own version tailored for motif search) explicitly adopt a complete and independent shadowing model at the nucleotide level. These methods are all based on the assumption of homogeneity of functionality across orthologous nucleotides, which is not always true even among relatively closely related species (e.g., of divergence less than 50 mya in Drosophila).

Empirical estimation and simulation of turnover events is an emerging subject in the literature [12],[22], but to our knowledge, no explicit evolutionary model for functional turnover has been proposed and brought to bear in comparative genomic search of non-conserved motifs. Thus our CSMET model represents an initial foray in this direction. Closely related to our work, two recent algorithms, rMonkey [12]—an extension over the MONKEY program, and PhyloGibbs [9]—a Gibbs sampling based motif detection algorithm, can also explicitly account for differential functionality among orthologs, both using the technique of shuffling or reducing the input alignment to create well conserved local subalignments. But in both methods, no explicit functional turnover model has been used to infer the turnover events. Another recent program, PhyME [10], partially addresses the incomplete orthology issue via a heuristic that allows motifs only present in a pre-chosen reference taxon to be also detectable, but it is not clear how to generalize this ability to motifs present in arbitrary combination of other taxa, and so far no well-founded evolutionary hypothesis and model is provided to explain the heuristic. Non-homogeneous conservation due to selection across aligned sites has also been studied in DLESS [23] and PhastCons [24], but unlike in CSMET, no explicit substitution model for lineage-specific functional evolution was used in these algorithms, and the HMM-based model employed there makes it computationally much more expensive than CSMET to systematically explore all possible evolutionary hypotheses. A notable work in the context of protein classification proposed a phylogenomic model over protein functions, which employs a regression-like functional to model the evolution of protein functions represented as feature vectors along lineages in a complete phylogeny [25], but such ideas have not been explored so far for comparative genomic motif search.

Various nucleotide substitution models, including the Jukes-Cantor 69 (JC69) model [26], and the Felsenstein 81 (F81) model [27], have been employed in current phylogenetic shadowing or footprinting algorithms. PhyloGibbs and PhyME use an analogue of F81 proposed in [28], which is one of the simplest models to handle arbitrary stationary distributions, necessary to model various specific PWMs of motifs. Both PhyME and PhyloGibbs also offer an alternative to use a simplified star-phylogeny to replace the phylogenetic tree when dealing with a large number of taxa, which corresponds to an even simpler substitution process.

The CSMET Approach

Our CSMET model differs from these existing methods in several important ways. First, it uses a different evolutionary model based on a coupled-set of both functional and nucleotide substitution processes, rather than a single nucleotide substitution model to score every alignment block. Second, it uses a more sophisticated and popular nucleotide substitution process based on the Felsenstein84 (F84) model [29], which captures the transition/transversion bias. Third, it employs a hidden Markov model that explicitly models autocorrelation of evolutionary rates on successive sites in the genome. Fourth, it uses an efficient deterministic inference algorithm that is linear to the length of the input sequence and either exponential (under a full functional phylogeny) or linear (under a star-shaped functional phylogeny) to the number of the aligned taxa, rather than the Monte Carlo or heuristic search algorithms that require long convergence times.

Essentially, CSMET is a context-dependent probabilistic graphical model that allows a single column in a multiple alignment to be modeled by multiple evolutionary trees conditioned on the functional specifications of each row (i.e., the functional identity of a substring in the corresponding taxon) (Figure 2). When conjoined with a hidden Markov model that auto-correlates the choices of different evolutionary rates on the phylogenetic trees at different sites, we have a stochastic generative model of phylogenetically related CRM sequences that allows both binding site turnover in arbitrary subsets of taxa, and coupling of evolutionary forces at different sites based on the motif organizations within CRMs. Overall, CSMET offers an elegant and efficient way to take into consideration complex evolutionary mechanisms of regulatory sequences during motif detection. When such a model is properly trained on annotated sequences, it can be used for comparative genomic motif search in all aligned taxa based on a posterior probabilistic inference algorithm. This model can be also used for de novo motif finding as programs such as PhyloGibbs and PhyME, with a straightforward extension of the inference procedure that couples the training and prediction routines in an expectation-maximization (EM) iteration on unannotated sequence alignments. In this paper, we focus on supervised motif search in higher eukaryotic genomes.

We compare CSMET with representative competing algorithms, including EMnEm, PhyloHMM, PhyloGibbs, and a mono-genomic baseline Stubb (which uses an HMM on single species) on both simulated data, and a pre-aligned Drosophila dataset containing 14 developmental CRMs for 11 aligned Drosophila species. Annotations for motif occurrences in D. melanogaster of 5 gap-gene TFs - Bicoid, Caudal, Hunchback, Kruppel and Knirps - were obtained from the literature. We show that CSMET outperforms the other methods on both synthetic and real data, and identifies a number of previously unknown occurrences of motifs within and near the study CRMs. The CSMET program, the data used in this analysis, and the predicted TFBS in Drosophila sequences, are available for download at http://www.sailing.cs.cmu.edu/csmet/.

Results

The CSMET Model

Model for phylogenetically related motif sequences.

To motivate and explain the statistical foundation and biological rationale underlying the CSMET model, we begin with a brief description of a conventional model for phylogenetically related sequences based on the classical molecular substitution process, where functional turnover of motifs is not explicitly modeled. This model will be used as a component in our proposed model.

Consider a multiple alignment of M instances of a motif of length L. Let A denote an M × L matrix containing M rows a1,…,aM, each representing an instance of this motif, i.e., ai ≡ [ai,1,…,ai,L], where . Due to the stochastic nature of the sequence composition of TFBSs, a popular representation of a motif pattern is the position weight matrix (PWM), θ≡(θ1,…,θL), of which each column vector θl defines a multinomial probability distribution of the nucleotides observed at the lth position of instances of this motif. That is, , where is an indicator function that equals to 1 when x = y and 0 otherwise. Under a PWM, all sites in the motif are assumed to be mutually independent, thus the probability of a length-L instance is simply a product of the probabilities of nucleotides at every site: . When the motif instances in A are from different genomic locations of a single species (i.e., they are phylogenetically unrelated), the likelihood of the aligned motifs A is simply a product of the likelihoods of every instance ai, , which means all the rows in A are independent of each other (although in reality, they might not evolve independently.)

If A contains M phylogenetically related motif instances each from a different species, then a straightforward way to model the likelihood of A is to assume that the instances therein from different taxa are shadowed by a phylogenetic tree that defines a nucleotide-level substitution process from an ancestral sequence [29],[30] (Figure 2A). Our proposed method uses this model as a building block.

Formally, a phylogenetic shadowing model Tm for a motif is a tree-likelihood model specified by a four-tuple ,τ,β,λ}, where θ≡(θ1,…θL) represents the equilibrium nucleotide distributions at the root of the evolutionary tree of every site within the motif; τ≡(τ1,…,τL) denotes the (usually identical) topologies of the evolutionary trees of every site; β≡(β1,…,βL) denotes the sets of branch lengths of the evolutionary trees; and λ represents where necessary some additional evolutionary parameters of the motif depending on the specific nucleotide substitution models. Under a phylogenetic shadowing model, the probability distribution of nucleotides in any taxon that corresponds to a leaf conditioning on its predecessor in the tree can be derived based on a continuous-time Markov model of nucleotide substitution along the tree branches [30]. We employ the F84 substitution model parameterized by a given equilibrium distribution, a transition/tranversion ratio ρ, and a total substitution rate μ that can be estimated from training data [29]. Detailed derivation and explicit expressions are provided in Materials and Methods.

Typically, we can use the PWM of the motif as the equilibrium distribution of the motif phylogeny. For simplicity, one can also assume that all sites within the motif share the same topology τ and the same branch lengths β. This means that the evolutionary processes underlying each site within the motif are homogeneous. Similarly, we can define Tb ≡ {θb, τb, βb, λb} for the background. Assuming that sites within the motif evolve independently, the likelihood of M aligned L-mers can be expressed as:
(1)
where Al denotes the lth column in A, and is the marginal likelihood of the leaves under an motif-site-specific evolutionary tree for nucleotide substitution, which can be computed using Felsenstein's pruning algorithm [30], as detailed in Materials and Methods.

To model a multiple alignment of regulatory regions that is N base-pairs long and contains motifs at unknown positions, we can assume that every L-mer block in the alignment can correspond to either a motif sequence, or the background, specified by a hidden functional state Zt, where t denotes the position of the left-most column of the block in the alignment. (For simplicity, we consider only one motif type here, but the formulation readily generalizes to multiple motif types.) The state sequence ZZ1:N can be thought of as a functional annotation sequence of an ancestral regulatory region of length N. In the EMnEM model [20], the Zt's are assumed to be independently sampled from a Binomial distribution of motif and background states, similar to the classic mixture models of motif underlying MEME (Figure 2A). In a PhyloHMM originally proposed in [3] for comparative gene finding, which can be easily extended for motif search, Z1:N can follow a hidden Markov model that captures the transition probabilities between background and motifs.

Model for motif turnover.

A caveat of the phylogenetic shadowing model described above is that, at every location t, the functionality indicator Zt must apply to all the taxa (i.e., rows) in the alignment (as illustrated in Figure 2A), meaning that the aligned substrings from all taxa at this position are derived from the same evolutionary tree (either the motif or the background tree, depending on the value of Zt; when Zt is hidden, this results in a mixture of two complete trees). This is a strong orthologous assumption which insists that every row in the alignment block must have evolved from the same most recent common ancestor (MRCA) according to the same molecular evolution model. This assumption might not be valid for every region in the alignment due to abrupt functional turnover such as whole motif insertion/deletion, or due to imperfect alignment that fails to identify the true sequence orthology.

We assume that every sequence segment in an alignment block, generically referred as At where t denotes the left-most position of the alignment, has its own functionality indicator . Generalizing the molecular evolution model for base substitution, we posit that the functional annotation vector of a block of aligned segments are themselves governed by a coarser-grained evolutionary tree that models the evolution of the functionalities of the attendant segments in different taxa (Figure 2B). We refer to this evolutionary tree as a (functional) annotation tree (or, interchangeably, a functional phylogeny), denoted by Tf ≡ {α, τf, bf, λf}. In such a tree model, each leaf represents a random variable whose value reveals the functional status (i.e., being a motif, background, or more detailed function information such as motif types, etc.) of the segment from taxon i, and the root is characterized by a hypothetical ancestral functionality indicator and an equilibrium distribution α. Along the branches of this tree, the functional states evolve according to a functionality substitution model, in much the same way the nucleotides do under a molecular substitution model, except that now the model parameters Tf are fitted differently (we will return to this point in the Materials and Methods section) and the evolutionary dynamics can also have richer structures. For example, in the model proposed by [25] for protein function evolution, the evolutionary dynamics were captured by a logistic regression rather than a constant-rate continuous-time Markov process used in standard molecular substitution models. For simplicity,here we adopt a simple JC69 model for functionality substitution, which is denoted as PF(Zt|Tf) (see Equation 4 in Materials and Methods), and defer the exploration of richer models to future research. In summary, the functional phylogeny Tf models the quantum changes of functional elements (rather than the fine-grained changes at the nucleotide level) during evolution in terms of whether an entire functional element is preserved, lost, or emerged, during the course of speciation.

Conditional shadowing under motif turnover.

To capture the effect of motif turnover, we assume that conditioning on the functional states of all rows (i.e., species), which are represented as a random column vector distributed according to the functional phylogeny specified by Tf, the sequences in alignment block t admit either a marginal motif phylogeny or a marginal background phylogeny. As shown in Figure 2B, typically, for a given block, only a subset of the rows correspond to conserved instances of a motif (e.g., rows 1, 2, and 3), and therefore their joint probability is defined by a marginal phylogeny of the full motif phylogeny (i.e., the subtree highlighted by solid red lines in Figure 2B). The remaining part of the motif phylogeny (represented by the subtree in dotted red lines in Figure 2B), which corresponds to taxa where the corresponding motifs had turned-over to background sequences, needs to be marginalized out. We can efficiently compute the likelihood of the preserved motif instances under the marginal motif phylogeny , expressed as using the standard pruning algorithm. Similarly, the subset of rows corresponding to the background or merely gaps admit a marginal background phylogeny (e.g., the blue tree with leaves only correspond to rows 4 and 5 in Figure 2B). Putting these two parts together, now for every position t in the input alignment, we have the following joint probability (i.e., the complete likelihood) of the observed alignment block At, the vector of instantiated extant functional states zt, and an instantiated ancestral functional state under a conditional shadowing model with multiple evolutionary trees (aka, CSMET):
(2)
In practice, the leaf functional states zt of an alignment block starting at position t, and the ancestral functional state are not observed. Thus the likelihood score of At follows a complex mixture of marginal phylogenies defined by all possible joint configurations of functional states and the ancestral state zr, rather than a simple motif/background mixture as in extant models. The typical tasks in motif detection involves either computing the marginal conditional likelihood for all possible states of , which will be used as the emission probability in an HMM of the ancestral functional states over the entire alignment (to be detailed in the next section); or the marginal posterior P(zt|A1:T), which will be used to extract the maximum a posteriori (MAP) motif annotation of the alignment. Both tasks involve a marginalization step that sums over all joint configurations of the internal tree nodes, zr's, and zt's. This leads to an inference problem in a state space defined by the product of multiple trees and therefore can be computationally intensive. Since in practice it is unusual to encounter more than 20 or so taxa in the comparative genomic setting, inference is still feasible. In this case, one can apply a coupled-pruning algorithm described in the Materials and Methods or a standard junction tree algorithm [31] for exact inference.

For an alignment of a large number of species and/or for a problem which involves searching for a large number of motifs simultaneously, marginalization of the product space of trees can be prohibitive. In these circumstances, we can apply an approximate inference method such as the generalized mean field algorithm [32], which decomposes the coupled trees in CSMET into disjoint trees and applies iterative message-passing across these trees to obtain an approximate posterior of zt or the conditional likelihood of At. Alternatively we can replace some or all of the full phylogenetic trees for motif, background and functional evolution by star-topology phylogenies as in PhyloGibbs [9]. For simplicity of the exposition, we omit details of these generalizations.

Tree- and rate-transition along alignment.

Different sites in the genome are subject to different evolutionary constraints and therefore follow phylogenetic trees with different equilibriums, topologies and rates. The conditional phylogenetic shadowing model described above couples multiple site-specific trees of all sites within a moving window of alignment block via a functional phylogeny; but it does not explicitly model transitions between possibly different evolutionary processes as the window scans over different functional entities along the aligned sequences, for example, transitions between motifs and different background regions, and among different motifs.

We introduce a hidden Markov model to model the transitions between functional annotations along the alignment. In principle, this HMM can employ highly structured transition models such as the global HMMs used in LOGOS [33] or CISTER [34], which intend to capture sophisticated “motif grammars” underlying higher eukaryotic CRMs. In this paper, we adopt a simplistic 3-state HMM that models the length of the spacer between motifs as a geometric distribution, and allows the motifs to be on either strand of the DNA. We define the HMM over the sequence of ancestral functional states , modeling the spatial transitions of functionalities along a hypothetical ancestral regulatory sequence underlying the aligned sequences from the study species. To model TFBS on either DNA strand with opposite orientations, two functional states are needed for each type of motifs, which determine the appropriate orientation for the PWM employed by the motif tree Tm for defining the likelihood of a selected DNA substring; but these two functional states correspond to a degenerated motif state (i.e., ) at the root of the functional tree Tf in CSMET, and follow the same turnover process. Details of such an HMM is given in Materials and Methods.

Unlike the standard HMM for mono-genomic motif detection where the emission probability uses a simple conditional multinomial distribution of a single nucleotide, or a PhyloHMM for comparative-genomic motif detection ignoring motif turnover where the emission probability is defined by a conditional likelihood of a column of aligned nucleotides under a single phylogeny, to accommodate functional turnover of segments in certain species in the alignment, we define the emission model to be the CSMET conditional likelihood of an alignment block, , and thereby enable conditional shadowing over the taxa at each site. A technical issue arising from this construction is that unlike the PhyloHMM, which is still a standard 1st-order HMM, in our case we have a higher-order HMM due to the contex-dependent coupling of all the sites within a motif by the functional phylogeny Tf, which models the whole sequence segment within a window of length L as a unit. In the next section, we outline statistical inference strategies that address this technical issue.

Strategy

Posterior inference.

The incorporation of the functional phylogeny Tf to explicitly model functional turnover of entire segments (rather than individual sites) of DNA sequences in different taxa in a multiple alignment introduces not only higher-order Markov dependencies among sites, but also context-dependent dependencies among taxa. Thus CSMET is essentially a probabilistic model with context-specific independencies, which is well-known to be intractable in general [35]. Figure 3A and 3B show an example of the context-specific relationships among variables due to two different possible value-configurations of the hidden variables corresponding to ancestral and taxa-specific functional annotations (of a small chunk of the alignment). Computing the likelihood of the entire alignment requires a summation of all joint configurations of all of these hidden variables, for which no efficient exact algorithm resembling the dynamic programming algorithms applied to HMMs is available.

thumbnail

Figure 3. Context-specific relationships among variables due to two possible value configurations of hidden variables shown in (A) and (B).

Note that when the ancestral state is a motif, the segment corresponding to the TBFS evolves as a unit (as shown by the arrow from an extant functional state pointing to a multi-column segment), either retaining its functionality as a motif, or turning-over to a background segment, as illustrated in (A). When the ancestral state is a background, then every position can evolve independently as long as it is still in the background (as shown by the arrow from a functional state pointing to a single column). But when a motif emerges out of the background, as shown in (B), the segments corresponding to the TFBS start to evolve as a unit, causing even the aligned nucleotide positions to evolve under different positional constraints. (C) Outlines the idea of a block approximation of the CSMET emission probability.

doi:10.1371/journal.pcbi.1000090.g003

While it is possible to implement a Monte Carlo algorithm that performs sampling over the functional annotation space of conditioning on the observed multiple alignment, we propose an approximate algorithm for posterior inference. As illustrated in Figure 3C, we can treat an N-column alignment as a sequence of (NL+1) consecutive L-column aligned blocks. We assume each such block At is either generated from a CSMET emission model conditioning on the ancestral function of this segment being a background, i.e., , or it can be generated from a CSMET conditioning on the ancestral function being a motif, say, of type k, expressed as . We can pre-compute the emission probabilities for all the aligned blocks, plug them back into an equivalent HMM of 's on blocks rather than on columns, and then compute the posterior probabilities or Viterbi-sequence of the labels of each block using the standard dynamic programming algorithms (e.g., forward-backward) for HMMs (see Materials and Methods section for details). The approximation introduced here lies in the approximate computing of the emission probabilities for the blocks, specifically at the boundary between motifs and background. For these blocks the likelihood of the aligned sequences should be defined by two different emissions, one on the background sub-block and the other on the motif sub-block, whereas our approximation employs only a single emission—either an entirely background-derived CSMET or an entirely motif-derived CSMET . But since our approximation results in a poorer fit only for the boundary regions, we expect that the overall posterior indication of the motifs, which is primarily driven by the emission probabilities of the motif blocks, will only suffer moderate weakening of contrast at the boundaries. We refer to this approximation method as block-approximation (BA). Another more subtle approximation due to BA is the ignoring of different turnover behaviors within a block At conditioning on the ancestral function of this segment( being a motif or a background), as exemplified in Figure 3B. Unlike a motif block derived from an ancestral motif, a segment of ancestral background sites do not evolve as a whole block, thus a block At entirely originated from a ancestral background can contain rows (descendents) that are either entirely non-motif, or partially non-motif and partially motif (i.e., starting from an arbitrary position t′ in window t:t+L, the segment t′: t′+L, part of which extends out of At, in an arbitrary taxon can evolve into a motif), whereas a block At entirely originated from a motif can only contain either fully preserved motif rows or turned-over non-motif rows. A detailed discussion of this subtlety is beyond the scope of this paper, and BA simply treats each entire row in At as a homogeneous functional evolutionary unit. The computational time for BA is linear in the length of the input, with a multiplicative factor determined by the length of the motif and the number of species concerned in the alignment. In case of multiple motifs, the emission probabilities of the blocks should be computed under the unique CSMET of each motif. Since motifs can have different lengths, bookkeeping of all the emissions can be slightly more complicated due to the need to handle blocks of different lengths. But the computational cost is only increased by the order of the number of the motifs in question. For simplicity, we defer details of this generalization to a later update of the CSMET.

With the BA strategy, we arrive at an approximation to the posterior distribution of motif annotation at every position given the entire alignment, , and the posterior of the sequence of ancestral functions, . For an alignment block of which only a few taxa correspond to motifs and others are merely background, under the CSMET model, the of this block can be either motif or background. In the first case, it means that absence of motifs in some taxa is interpreted as the result of loss of ancestral motifs, whereas in the second case, the presence of motifs in some taxa is interpreted as the result of emergence of nascent motifs out of the background. As far as we are aware of, CSMET is the only motif-finding algorithm that rigorously offers a closed-form deterministic solution to the posterior probability distribution of motif annotations both in the alignment and in the ancestral sequence over the entire space of binding site configurations. PhyloGibbs [9] offers a sample-based solution to the posterior of given A1:N, but as mentioned earlier, it is not based on an explicit model of binding site turnover, and thus does not have a closed-form expression that can motivate efficient deterministic approximation.

Maximum likelihood training.

The CSMET can be trained on annotated CRM alignments. We need to learn the nucleotide phylogenetic trees for motifs and backgrounds, and the phylogenetic tree that describes the evolution of functional annotation. We use the F84 model for nucleotide substitution on the motif and background trees; for evolution of functional annotation, we use the simpler JC69 model. As detailed in Materials and Methods and Text S1, for a given tree topology, for the JC69 model all we need to estimate is the branch length on the tree, which relates to total substitution probability. For the F84 model, besides the tree topology, we need to estimate the stationary distribution, which we set to be the PWM for motif phylogenies or the background nucleotide frequencies for background phylogeny; and also two additional evolutionary parameters: the overall substitution rate per site μ and the transition/transversion ratio ρ.

Given a multiple alignment, the ground truth of functional annotation, the PWMs for motifs, and nucleotide frequency for the background, we use the following strategy for estimating the trees and the evolutionary parameters. Detailed derivation and explicit expressions are provided in Materials and Methods.

  • Find a tree topology τ and the branch lengths β by running fastDNAml [36] over the entire alignment.

  • Find a scaling factor rf over branch lengths βf of the functional tree Tf, by maximizing the likelihood of aligned functional annotations under Tf via a line-search in parameter space.

  • Find a scaling factor rm over branch lengths βm of the motif tree Tm, and the Felsenstein rate μm, by maximizing the likelihood of aligned motif sequence under Tm with the F84 model.

  • Find a tree topology τb and branch lengths b0 for background tree Tb by running fastDNAml directly over only the background sequences. The Felsenstein rate μb is then estimated by maximizing the likelihood under Tb with a simple line-search.

To compute the Felsenstein substitution rate μ, we use a fixed transition-transversion ratio of 2. If the stationary nucleotide distribution defined by the motif PWM is incompatible with this value of the transition-transversion ratio, we set it to the smallest value that is compatible with the stationary distribution as in [5].

Performance on Synthetic Data

At present, biologically validated orthologous motifs and CRMs across multiple taxa are extremely rare in the literature. In most cases, motifs and CRMs are only known in some well-studied reference taxa such as the Drosophila melanogaster; and their orthologs in other species are deduced from multiple alignments of the corresponding regulatory sequences from these species according to the positions and PWMs of the “reference motifs” in the reference taxon. This is a process that demands substantial manual curation and biological expertise; rarely are the outcomes from such analysis validated in vivo (but see [8] for a few such validations in some selected Drosophila species where the transgenic platforms have been successfully developed). At best, these real annotations would give us a limited number of true positives across taxa, but they are not suitable for a systematic performance evaluation based on precision and recall over true motif instances. Thus we first compare CSMET with a carefully chosen collection of competing methods on simulated CRM sequences, where the motif profiles across all taxa are completely known.

We choose to compare CSMET with 3 representative algorithms for comparative genomic motif search, PhyloGibbs, EMnEM, PhyloHMM; and the program Stubb, which is specialized for motif search in eukaryotic CRMs, and in our paper, set to operate in mono-genomic mode. The rationale for choosing these 4 benchmarks is detailed in the Material and Methods.

Multi-specific CRM simulator.

We developed a simulator of multi-specific CRMs with flexible TFBS turnover dynamics across taxa and realistic TFBS arrangement within CRM. The overall scheme is illustrated in Figure 4. Specifically, the input of the simulator includes: 1) topologies of the phylogenetic trees for nucleotide (e.g., in motif sites and background) and functionality substitutions; 2) prior distributions of the stationary distribution of states (i.e., nucleotide or functionalities) at the roots of the trees; 3) prior distributions of the branch lengths of the trees and the substitution rates, and other evolutionary parameters where necessary (e.g., the Felsenstein rate μ and ρ in F84 model); 4) a global HMM encoding the motif grammar in the CRMs. As detailed in the Material and Methods, during simulation, all building blocks of a CRM, such as the motif instances, background sequences, functionality states (that determines motif turnover) in different taxa, and positions of the motifs in the CRM are sampled separately as illustrated in Figure 4, and put together to synthesize an artificial CRM. This simulator can be used to simulate realistic multi-specific CRMs resulting from various nontrivial evolutionary dynamics. It is useful in its own right for consistence/robustness analysis of motif evolution models and performance evaluation of comparative genomic motif-finding programs.

thumbnail

Figure 4. An illustration of the generative scheme of a Multi-specific CRM simulator.

doi:10.1371/journal.pcbi.1000090.g004

Below, we report results of four experiments based on simulated datasets. Each experiment was based upon varying one parameter of the model, keeping all the others fixed, in order to analyze robustness of CSMET and various other methods under different conditions. Every simulated CRM alignment contained 10 taxa, and for each experiment we simulated 50 datasets. The simulated data is available at the CSMET website to allow external comparisons. Performance of all the tested programs were based on the precision, recall and their F1 score (i.e., the harmonic mean of precision and recall) [37].

Performance under varying degrees of motif turnover.

To examine the effect of motif turnover (i.e., functional conservation) in aligned regions across taxa on the motif-detection performance, we simulated CRM alignments with differing magnitudes of the evolutionary rate along the functional phylogeny. Since known motifs in the Drosophila species we are working with usually have around 75% conservation, we chose our evolutionary rates so as to achieve conservation percentages between 64%–75% (or equivalently, turnover percentages between 25%–36%) at the species-specific motif-instance level. (See Text S1 for how this is achieved.)

We find that even with increasing rates of functional turnover, the performance of CSMET and Phylogibbs remain largely stable, with CSMET consistently dominating PhyloGibbs in F1 score with a modest margin (Figure 5). The margin is statistically significant with p = 2.48×10−7 under a paired t-test. EMnEM has a high recall score, but overall its F1-scores are well below CSMET and PhyloGibbs, also it appears to be affected more by the increased turnover rates. PhyloHMM shows an interesting trend, it performs better than its non-phylogenetic cousin Stubb on data with low turnover rates, but its performance worsens when compared to Stubb on data with increasing turnover rate. This shows that a naive application of phylogenetic shadowing in multi-species alignment with high functional divergence can actually result in degraded performance compared even to just single species analysis.

thumbnail

Figure 5. Performance under varying degrees of functional conservation.

doi:10.1371/journal.pcbi.1000090.g005
Performance under varying degrees of motif/background contrast.

The difference in conservation between the motif and background sequences will have an impact on the performance of the model. However, this experiment can be performed in two different ways: changing the degree of similarity between motif and background stationary distributions; and changing the evolutionary rates of one or the other. We choose the second method and conduct the simulation as follows: we attribute the motif phylogeny with a low entropy stationary distribution resembling a PWM, and with a fixed evolutionary rate; and we let the background to have a stationary distribution similar to but with higher entropy than that of the motif, and have a variable evolutionary rate. The evolutionary rate in the background tree is changed gradually from low values to high values, by varying the scaling factor applied to the background tree from 1 to 8. This is to check how well the CSMET model may detect motifs emerging out of the background with differing degrees of sequence-level conservation with respect to the background caused by their relative evolutionary rates. The corresponding performances are shown in Figure 6.

thumbnail

Figure 6. Performance under varying degrees of motif/background contrast.

doi:10.1371/journal.pcbi.1000090.g006

We found that even under low variation between the motif and background, i.e., both following an evolutionary tree with similar stationary distribution, and the same branch lengths and scaling parameters, CSMET outperforms all the other methods. CSMET steadily improves in performance upto the scaling factor of 4, after which its performance roughly plateaus. PhyloGibbs behaviors similarly, but overall with lower F1 scores that is statistically significant (p = 1.41×10−14). EMnEM, on the other hand outperforms all other methods for scaling factors of 6 or more; meaning that when motifs are extremely highly conserved compared to the background, the advantage of modeling their turnover as in CSMET and PhyloGibbs over using a basic phylogenetic model diminishes, which is well expected. Since in real CRMs, the evolutionary rates of of the non-functional r