Search
Advanced Search
Average Rating (0 User Ratings)
    • Currently 0/5 Stars.
    See all categories
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
    Rate This Article

Open Access

Research Article

Probabilistic Phylogenetic Inference with Insertions and Deletions

Elena Rivas*, Sean R. Eddy

Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, Virginia, United States of America

Abstract

A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program DNAML in PHYLIP. Using standard benchmarking methods on simulated data and a new “concordance test” benchmark on real ribosomal RNA alignments, we show that the extended program DNAMLε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.

Author Summary

We describe a computationally efficient method to use insertion and deletion events, in addition to substitutions, in phylogenetic inference. To date, many evolutionary models in probabilistic phylogenetic inference methods have only accounted for substitution events, not for insertions and deletions. As a result, not only do tree inference methods use less sequence information than they could, but also it has remained difficult to integrate phylogenetic modeling into sequence alignment methods (such as profiles and profile-hidden Markov models) that inherently require a model of insertion and deletion events. Therefore an important goal in the field has been to develop tractable evolutionary models of insertion/deletion events over time of sufficient accuracy to increase the resolution of phylogenetic inference methods and to increase the power of profile-based sequence homology searches. Our model offers a partial answer to this problem. We show that our model generally improves inference power in both simulated and real data and that it is easily implemented in the framework of standard inference packages with little effect on computational efficiency (we extended DNAML, in Felsenstein's popular PHYLIP package).

Introduction

A fundamental task in sequence analysis is calculating the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. This quantity is already at the heart of phylogenetic tree inference by either maximum likelihood [1] or Bayesian approaches [2][5]. It is also desirable to integrate evolutionary modeling into the probabilistic models widely used for other sequence analysis problems such as HMMs and SCFGs for homology search and genefinding [6]. However, whereas HMMs, SCFGs, and other sequence analysis models account for insertion and deletion events in their probabilistic framework, many widely used phylogenetic models only account for residue substitution events.

The general approach of modeling residue substitution as a continuous-time Markov process was introduced by the Jukes-Cantor model of nucleotide substitution in DNA [7] and the Dayhoff pam model of amino acid substitution in proteins [8]. It has since been extensively developed both for nucleotides [9][14] and amino acids [15][18], and extended to models of more than a single residue, such as codon to codon substitutions [19],[20] and RNA basepair to basepair substitutions [21][23].

Given a substitution model and a tree, one can efficiently calculate the probability of an ungapped multiple alignment using Felsenstein's peeling algorithm [1]. The Felsenstein algorithm scales linearly with the length of the alignment and the number of sequences. Because of its economy, it is the basis of maximum likelihood methods in many practical phylogenetic inference tools, including PHYLIP [24], PAUP* [25], and others [26][33]. The Felsenstein algorithm is readily integrated with other probabilistic models for ungapped alignment analysis, including HMMs [14], [34][38] and SCFGs [21]. However, when this approach is applied to gapped multiple sequence alignments, gap characters are typically treated as missing data (an unknown residue), effectively equivalent to ignoring them.

A variety of more formal approaches for treating insertions and deletions exist [39][49]. The canonical model in this active area of research is the Thorne-Kishino-Felsenstein model (TKF91) [50]. TKF91 treats insertion and deletion events as a continuous-time process governed by explicit insertion and deletion rate parameters, allowing multiple insertions and deletions to accumulate at the same “site” over long times. TKF91 improved, for example, upon methods that parsimoniously assume no more than one change per site per branch, including the pioneering Bishop and Thompson pairwise alignment likelihood model that preceded TKF91 [51]. Many extensions of TKF91 have appeared 5256, including practical applications for pairwise alignment [57],[58] and multiple alignment [59][62]. Several approaches based on TKF91 or related models have addressed the problem of simultaneously aligning and inferring the phylogeny of a group of related sequences [63][68]. In general all algorithms in this class have difficult time complexities, worst-case exponential in the number of sequences, and at least in the case of parsimony models akin to TKF91 , the problem of inferring the optimal insertion and deletion history given a tree has been formally shown to be NP-complete [47]. This complexity is inherent to the problem. Any implementation of evolutionary models that allow insertions and deletions, TKF91-based or otherwise, seeks to make approximations that make calculations tractable.

Here we are specifically concerned with the problem of calculating the probability of a given multiple alignment and phylogeny. A TKF91-based approach for this problem [69] used a clever approach of using ungapped columns in the alignment to constrain and subdivide the solution space (a so-called “homology structure”), but (because TKF91 is not invariant under column rearrangement) a sum over all subalignments compatible with a given homology structure is still required. As a result, though time complexity exponential in the number of sequences can be avoided in the average case where not many gaps occur, the approach remains expensive in absolute terms; a Bayesian Markov chain Monte Carlo phylogenetic inference for an alignment of 10 globins was reported to require 3 CPU hours on a 1.25 GHz G4 Apple Macintosh [69]. Although TKF91 has many desirable and realistic properties as a model of evolution, it would be advantageous to have even more computationally efficient approaches, particularly for problems where a gain in efficiency might outweigh sacrificing some of the realism of the model.

An alternative is to use a continuous-time Markov process for insertions and deletions that remains fully compatible with the Felsenstein algorithm [41],[46]. This requires an unrealistic assumption of column independence, but nonetheless, it may be better than ignoring gaps altogether. Here we further explore such models. We propose a non-reversible generative model based on a Markov process that we readily incorporated into an existing phylogenetic inference application, resulting in a gain in its accuracy.

Results

Premises of the Model

The originating idea is to use an extended (K+1)×(K+1) rate matrix for K residues (4 nucleotides or 20 amino acids) plus the gap character to describe the rate of change of a residue to a residue (substitution), a residue to a gap (deletion), and a gap to a residue (insertion) [6],[41],[46]. From this Markov process, we construct a generative model of sequence evolution that includes insertions and deletions. Several consequences flow from this, which here we discuss informally by way of introduction to the rest of the paper.

This model describes the evolution of single residues in one column of a multiple alignment, given a phylogenetic tree. The total probability of the alignment is then assumed to be an independent product of each column probability. For each column, a variant of the Felsenstein peeling algorithm recursively infers the probability of ancestral characters at each tree node, where an ancestral character is either a residue or a gap. Assuming that insertion and deletion events happen one residue at a time necessarily implies a linear “gap cost”. This is a much less satisfactory model of insertion and deletion processes than models that can assume an affine or arbitrary gap cost.

It has generally been thought that models based on a gap-extended rate matrix must be conceptually flawed, because it appears necessary to assume that all ancestral and descendant sequences fit in a fixed number of columns. This fundamentally conflicts with allowing any number of insertions and deletions to occur, and it produces a so-called “memory effect” artifact [70] in which descendants “remember” how many gap characters were present in ancestral sequences, allowing insertions up to that length and precluding longer ones. A related conceptual flaw would be treating gaps like residues, assigning a probability to a gap/gap alignment (as one would do for any residue/residue alignment), rather than recognizing that a gap/gap alignment may represent no evolutionary event at all from the standpoint of just the descendant and ancestral sequence; rather, gap/gap alignments are imposed by events that occurred in other sequences. For a model to be at all satisfactory, one must be able to describe a generative evolutionary model unconditional on any fixed sequence length, in terms of substitution, insertion and deletion events that evolve one (unaligned) sequence to another, and show how that generative process relates uniquely to the column-by-column inference algorithm that one will apply to a given multiple alignment.

Here we will develop such a generative model, by borrowing terms from a Markov process for a rate matrix extended for the gap character. The rate matrix includes the gap character, but the subsequent generative evolutionary model does not treat gaps as an extra residue. Rather, it describes evolutionary insertion and deletion (birth-death) events [71], where the evolved sequences form alignments of arbitrary length. The existence of a generative evolutionary model for unaligned sequences, and the mapping of its events to the column-by-column inference procedure, is the crucial point of differentiation between our work and previous work on efficient column-based phylogenetic inference with gap-extended rate matrices [6],[41].

Similarly, one must have a consistent way of dealing with columns that are unobserved in the alignment of extant sequences – that is, places where ancestral residues have been inserted and deleted, where alignment columns would exist if all the ancestral sequences were known in addition to just the extant sequences. Therefore the likelihood we calculate for an extant multiple alignment will be its marginal likelihood, marginalized over all possible ancestral sequences including unobserved alignment columns that left no trace in the observed alignment.

Another conceptual problem of column-based models arises if one adopts the usual practice of making the substitution process reversible (in the sense that the probability of an ancestor/descendant sequence alignment is independent of the direction of time along the branch that connects them; this is mathematically convenient for applying the “pulley principle” and using the Felsenstein peeling algorithm on unrooted trees). If one assumes reversibility for a substitution process that includes gaps, one necessarily imposes a frequency of gap characters that is constant with respect to divergence time [41]; but obviously in the limit of zero divergence time, there are no gap characters in an alignment of homologous sequences. Moreover, reversibility clearly cannot hold if insertion and deletion rates are free parameters. For example, for a nonzero insertion rate and zero deletion rate, the shorter of two homologous sequences automatically must be the ancestor. A reversible insertion/deletion model would also imply that all homologous sequences have the same expected length at all divergence times. We overcome these problems by adopting a non-reversible model that states an explicit prior length distribution for unaligned ancestral sequences (as opposed to assuming that all K+1 characters including gaps occur in the ancestor at the stationary frequencies of the Markov process). Because our model is non-reversible, we must always work with rooted phylogenies.

Because we map generative evolutionary model events onto an alignment column, the column is assumed to be correctly aligned phylogenetically – all aligned residues are assumed to be homologous and related only by substitution events. This means that no more than one insertion event may occur in any given column. Enforcing this assumption requires modification of the usual Felsenstein peeling algorithm to include some extra bookkeeping.

Given alignments are unlikely to be phylogenetically correct, because humans and alignment programs tend to produce aesthetically pleasing alignments that compress columns containing few residues. Importantly, for any arbitrary tree topology and any arrangement of observed residues and gap characters in an alignment column, there exists at least one possible assignment of characters to ancestral nodes that makes all extant aligned residues homologous. Therefore the problem with using phylogenetically incorrect alignments is not that the algorithm will fail altogether (as would happen if some combinations of alignments and trees were impossible), but rather that we can expect its inference ability to be degraded by forced inference of incorrect histories in phylogenetically incorrectly aligned columns. How much the overall inference is degraded by this and by the other assumptions described above is a matter for empirical testing, which we describe in the second half of the paper, after we describe the model itself.

Solving the Markov Process for a Gap-Extended Rate Matrix Model

First, we start by solving the Markov process associated with a rate matrix extended to include a gap character. For an alphabet of K residues, probabilistic substitution models are defined by a K×K rate matrix R such that the matrix of conditional probabilities Qt(i,j) ≡ P(j|i,t) is given by
(1)

We extend this to include the gap character by augmenting the rate matrix to a (K+1)×(K+1) matrix Rε that depends on arbitrary rates of deletions μ≥0 and insertions λ≥0:
(2)
where p = (p1,…,pK) is the distribution of inserted residues , and δij stands for the Kronecker delta in the K×K subspace (valued one if i = j and zero otherwise). Since the extended rate matrix Rε has the property that each row adds up to zero, we can construct a model of evolution for the extended rate matrix defined as
(3)
At zero divergence, the probabilities of any insertion or deletion of a residue or any substitution of a residue to a different residue are all zero.

Generally, the extended conditionals can be cast into the form,
(4)
with the conditions and , for each row i, and where Mt is the K×K conditional substitution matrix (to be defined later).

In particular, for a reversible K×K rate matrix R, if we assume that the distribution of inserted residues in equation (2) is the stationary distribution associated to the reversible rate R (πiR(i,j) = πjR(j,i)) then one can derive the analytic expression for in terms of the solution for the K×K conditional matrix Qt = etR and the rates of insertion and deletion. This particular solution for the extended conditional probabilities is given by
(5)
where the gap-specific functions γt and ξt are given by
(6)

(7)
when at least one of the two rates is positive. For the particular case of no insertions and deletions (λ = μ = 0) both functions are defined as identically zero.

Finally, we have to describe how to obtain the K×K conditional substitution matrix subspace Mt in from a rate matrix R and the insertion and deletion rates. Generally, any biologically relevant K×K substitution rate matrix R must have zero as a non-degenerate eigenvalue, and at least one other negative eigenvalue. To express Mt in general form, let (0,−e1,…,−eA) represent the eigenvalues of R, for 1≤AK−1, with ea>0 for 1≤aA. Then the standard conditional substitution probabilities could be expressed as:
(8)
where Oa is a K×K matrix of real numbers specific for each unique nonzero eigenvalue. Text S1 shows how to derive the Oa matrices from the similarity transformation that relates R with its diagonal form.

Now the extended conditional matrix in the substitution-only subspace Mt is expressed as:
(9)
If λ = μ = 0, Mt(i,j) is defined as the standard substitution conditional matrix Qt(i,j) in Equation 8.

For example, the F84 model [14] is defined by the substitution rate R(ij) = βπj+αΔij, where , and where the function εij is 1 if i,j are both either purines or pyrimidines, and 0 otherwise. The F84 rate depends on two non-negative parameters α and β, and on a stationary residue distribution π. The F84 rate matrix has non-zero eigenvalues e1 = β, and e2 = α+β. The matrix O1 is given by O1(i,j) = Δijπj, and the matrix O2 is given by O2(i,j) = δij−Δij. For the gap-extended F84 model, we obtain
(10)
which for the particular case λ = μ = 0 is defined as the original F84 model for substitutions,
(11)

The conditional model of McGuire et al. [41] describes a particular solution of the extended F84 model presented here, in which the rates of insertions and deletions are constrained to satisfy the conditions λ = β and μβ.

A more detailed description of the characteristic differential equations for an arbitrary gap-augmented rate matrix and the particular solution described above is provided in Text S1. Under different assumptions, such as a reversible substitution rate matrix but using a distribution of inserted residues other than the stationary distribution for R, or a non-reversible substitution rate matrix, could still be obtained numerically [72].

A note about reversibility. For a reversible substitution rate matrix, the particular Markov process including gaps solved here is also reversible, as can be seen by using the marginal frequencies πiλ/(λ+μ) for a residue, and μ/(λ+μ) for the gap character. However, one has to distinguish between the reversibility of a Markov process and the reversibility of an evolutionary process constructed using that Markov chain. A Markov chain is said to be reversible if and only if there exists a marginal distribution that satisfies the reversibility condition [73]. However, regardless of whether the Markov process is reversible in the strict sense, when constructing an evolutionary process from it one may specify an ancestral marginal distribution other than what reversibility requires. In evolutionary models, “reversibility” is generally taken to mean that the joint probability of an ancestral and a descendant sequence is invariant regardless of which sequence is used as the ancestor and which is used as the descendant. In probabilistic inference, this latter definition of reversibility is the most relevant (for instance to invoke the pulley principle [1]). From here on in this paper we use the term reversibility in this latter (broader) sense. Thus, the evolutionary model that we construct in the next section is not reversible in this (usual) sense though the Markov chain that it is based on is reversible in the strict sense.

Up to this point, this is essentially McGuire's model [41] (with a minor generalization). That model has the conceptual problems we described in the preamble, thought to be inherent to approaches based on gap-extended rate matrices. In the next section, we show how these problems may be circumvented.

The Generative Model

We construct the generative model as an independent product of single-event (substitution, deletion or insertion) contributions. What we will do to construct the model is to borrow terms from the conditional Markov process introduced in the previous section. We describe the probability of an insertion as proportional to ξtπi in Equation 7, the probability of a deletion as proportional to γt in Equation 6, and that of a substitution as proportional to Mt(i,j) in Equation 9, but we ignore the gap to gap transition of the Markov process. This allows us to derive a generative model that describes insertions and deletions not as mere “gap character” replacements on a fixed length alignment but as true evolutionary events (births and deaths of residues).

A given sequence x = {x1xl} that evolves to another sequence y = {y1yl} corresponds unambiguously to a pairwise alignment in which a substitution is represented by a conserved or mismatched column, a deletion by an ancestral residue aligned to a gap in the descendant sequence, and an insertion is represented by an ancestral gap aligned to a residue. Let xˆ = 1L and yˆ = 1L mean the aligned sequences x and y. Specifically, the probability that y was generated from x after time t with pairwise alignment xˆyˆ that includes sl substitutions, (ls) deletions, and (l′−s) insertions {I1…Il′−s} (where the subset of residues from x are substituted by in sequence y) is constructed as
(12)
The functions γt, ξt and are given by the Markov model solutions (Equations 6, 7, an 9). The residue distribution π, is set to the stationary distribution of the substitutions rate matrix. In that way, for a reversible substitution rate matrix, this generative model is quasi-reversible (i.e. reversible in the substitutions subspace).

In equation (12), everything but the term (1−ξt)l+1 is borrowed from the Markov process. This extra term is responsible for having a normalized distribution, such that the sum of the contributions of all sequences y of all possible lengths and all possible alignments is one. The extra term accounts for the fact given a sequence x of length l, insertions can occur at (l+1) places. That is, at each of the (l+1) places that an insertion in x could occur, an insertion of length z occurs with probability (a normalized geometric distribution).

Using the generative probability distribution (Equation 12), one can calculate the expected length (in residues without gaps) of descendant sequences originated from an ancestral sequence of length l after summing to all possible patterns of substitutions, insertions and deletions, that is given by (full derivation in Text S2):
(13)

Crucially, this model does not assume that all observed sequences are generated from a preset number of aligned columns. Thus there is no “memory effect” [70]. Our model is a generative probabilistic model with a close correspondence to the birth-death description of the TKF91 model [50], (see Text S2, for more detail).

We have implemented this generative model in a computer program for evolving sequences named εRATE.

The Joint Probability of a Pairwise Ancestor/Descendant Alignment

Given the conditional probabilities of the generative model, one wants to calculate the joint probability of a multiple alignment of sequences generated with the model, given a phylogenetic tree. Before doing that, let us consider the joint probability of a pairwise ancestor/descendant alignment, as a building block for a full-fledged algorithm for phylogenetic inference on multiple alignments.

Using the same notation as in the previous section, we can calculate the joint probability of the pairwise ancestor/descendant alignment (xˆ,yˆ) after divergence time t using the expression
(14)
where Pε(x) is the probability of the ancestral sequence. This distribution is a prior, not determined by the generative model. In order to have this prior factorize effectively into columns, we will assume the length of ancestral sequences follows a geometric distribution (1−p)pl for a sequence of length l with arbitrary Bernoulli frequency parameter 0<p<1.

The joint probability of a pairwise ancestor/descendant alignment can then be factorized as a product of terms over alignment columns as:
(15)
where the length of the alignment L = s+d+i. Normalization of this joint probability requires including the extra term Pε(★|t) = (1−p)(1−ξt) that cannot be associated to any observed column. Intuitively this term can be viewed as an extra (terminating) column (★). An analogous “extra column” term will appear in the multiple alignment case with the Felsenstein pruning algorithm.

In Text S3, we use the joint probabilities in Equation 15 to calculate other related length distributions of the model such as the length distribution for descendant sequences and the length distribution of alignment length, by summing over the other variables (marginalization). Because the model is non-reversible, the distribution of descendant sequences is different from that of ancestral sequences, and depends on the divergence time (non-stationary).

From the joint alignment probability distribution, we can also calculate the expected frequencies of insertions and deletions in pairwise ancestor/descendant alignments (see Text S3 for derivation). These quantities illustrate some important properties of the model. They will also be useful in making comparisons to the properties of other reversible models (see Discussion). They are,
(16)

(17)

Because the model is non-reversible, and are different (note that we avoid the traditional term “indel” in this paper, because we are not treating them reversibly). When the insertion rate is zero (and the deletion rate is positive), the expected frequency of insertions is zero for all divergence times but the expected frequency of deletions is not, as one would expect. Similarly, when the deletion rate is zero (and the insertion rate is positive), the expected frequency of deletions is zero for all divergence times while the expected frequency of insertions is not. For more detail, see Text S3.

Felsenstein's Peeling Algorithm Extended to Gaps

Felsenstein's peeling algorithm is an efficient algorithm for calculating the probability of a multiple alignment given a tree and a Markov substitution model. Extending the Felsenstein algorithm to include insertion and deletion events with the model we propose here requires four modifications. Those are: extra bookkeeping in the Felsenstein recursions to enforce that no more than one insertion occurs per column, so that all aligned residues are homologous; including a term from the prior ancestral sequence length distribution in the calculation of each individual column likelihood; including in the overall alignment likelihood the extra normalization terms collected in the “extra column” (★); and finally, marginalizing the contributions of possible ancestral residues that have left no trace in extant sequences. Otherwise, the substitution model assumed by the Felsenstein peeling algorithm is simply replaced by the generative model described in the previous section by Equation 12 which includes substitutions, insertions and deletions.

Using the notation of [6], for a given position u in the alignment, let Pu(Lk,i) be the probability up to node k given that the character (residue or gap) at node k is i. For a residue i, Pu(Lk,i) cannot contain any insertion only substitutions and deletions. For a gap, Pu(Lk,–) has to include one and only one insertion. Thus, Pu(Lk,–) is defined to be zero if all the nodes under k are gaps for that position. These probabilities are calculated recursively starting from the leaves of the binary tree as,

If node k is a leaf, for a residue i,
(18)
for a gap,
(19)

If node k is not a leaf, for a residue i,
(20)
for a gap,
(21)
where and are the two daughters of node k, , and are the distances from node k to its left and right child respectively, and where the probabilities for the daughter nodes have already been calculated by the recursion. uk stands for the subset of leaves under node k for column u, and uk = – indicates that all leaves under node k are gaps for column u. The single-event conditional probabilities are dictated by the generative model in Equation 12 as,
(22)

(23)

(24)
for 1≤i,jK, where the functions γt, ξt and are given by the Markov model solutions (Equations 6, 7, and 9). Figure 1 shows a graphical interpretation of the Felsenstein recursions described in Equations 20 and 21.

thumbnail

Figure 1. Felsenstein's peeling algorithm extended to gaps.

Graphical description of the extended Felsenstein algorithm to calculate the probability of an alignment given a phylogenetic tree using the generative model for gaps presented in this work, given by Equations 20 and 21. For a given column in a multiple alignment, recursion A corresponds to the probability of a tree up to node k with a residue at the node (Equation 20). Recursion B corresponds to the probability of a tree up to node k with a gap at the node (Equation. 21). The algorithm needs to consider only evolutionarily correct events, thus recursion A includes substitutions and deletions but no insertions, and recursion B has to include one and only one insertion. In the case of having a residue i at node k (recursion A), the A1 term corresponds to the original substitution-only Felsenstein algorithm, where q and s are residues, and ∑q and ∑s stand for the sum to all possible residue substitutions. The terms A2 and A3 represent a deletion occurring for the right and left child respectively (and a substitution for the other child node). Because no insertion can occur in this recursion, there are no evolutionary events happening for descendants for the child that suffers the deletion. Thus, for a A2 or A3 term to have a contribution all nodes down to the leaves for the child that suffers the deletion have to be gaps. (We represent with gray the situation of no evolutionary event happening, and with an empty gray triangle with gaps at the bottom the situation of gaps at all nodes including the leaves for a given subtree.) The A4 term corresponds to a deletion for both children nodes, and no evolutionary event from there on. A A4 term contributes only if with all the leaves under node k are gaps. In the case of having a gap at node k (recursion B), the insertion might occur for the left or right child, which is represented by graphs B1 and B2 respectively. The insertion might instead be delayed to a node under the left or right child, which is represented by graphs B3 and B4 respectively. In all four cases, the child node for which no insertion occurs does not have any evolutionary events, and has to include all gaps at the internal nodes and leaves. The substitution, deletion and insertion probabilities are given by the generative model as per Equations 22–24, respectively.

doi:10.1371/journal.pcbi.1000172.g001

The second modification is due to the existence of a length distribution for ancestral sequences for this model. In order to calculate the total probability of a given aligned column P(u|T,Rε,p), we need to use the probability of the sequence at the root. The total probability of site u is given by
(25)
where p is the parameter of the geometric distribution of ancestral sequences, and π are the residue frequencies at the root.

Third, the factorization in columns of the unconditional in length alignment distribution leaves some normalization terms that we gather together into what we think of as an “extra column” (★) contribution. Thus, when calculating the total probability of a multiple alignment as the product of l individual columns, there is an additional term in the equation:
(26)
The term P(★|T,Rε,p) = (1−p)P(Lroot) is calculated by a similar peeling algorithm,
(27)
where .

Lastly, one could delete an ancestral residue leaving no trace in extant sequences—a column could exist but not be present in the observed alignment—therefore we are interested in calculating P(L0) the probability of an alignment with L0 observed columns after marginalizing the unobserved columns. The probability of an alignment with L0 observed columns and k unobserved columns is given by
(28)
where the probability of an all-gaps column ugap is a particular case of Equation. 25. Then, for a given alignment of L0 observed columns, we marginalize to all possible unobserved columns as,
(29)
This is the final expression for the probability of a given alignment of L0 columns, after marginalizing all possible unobserved ancestral residues.

This algorithm will now reproduce the results of the original Felsenstein algorithm for ungapped alignments when the parameters λ = μ = 0 (except for a geometric term which is the same for all trees, thus will not affect the maximum likelihood estimation of a tree). In addition, λ = μ = 0 are the optimal parameter choices for any ungapped multiple alignment for any given tree. Notice that for the previous statement to be true, it is crucial to have the extra term (1−ξt) in the generative process of a substitution. Once we set μ = 0, the residue-residue substitution process in the presence of gaps (described by Equation 9) reaches the same asymptotic value (described by Equation 8) for any value of λ, including both the limit λ = 0 and the limit λ = ∞. It is the extra term that renders the total probability of any finite-length alignment to zero in the case λ = ∞, and makes λ = 0 optimal on ungapped alignments.

The extended peeling algorithm has worse-case time complexity for an alphabet of size K and a multiple alignment of L columns and n sequences.

An Implementation in phylip: dnamlε

To test an application of our model, we modified the program DNAML (from the PHYLIP package version 3.66, 4 August 2006 [24]) to use our generative model. Given a nucleotide multiple alignment, the DNAML program infers a phylogenetic tree for the sequences by maximum likelihood under an F84 rate matrix model [14]. Our modified program DNAMLε uses the extended F84 rate matrix given by Equations 5, 6, 7, and 10, and implements the extended Felsenstein algorithm described in the previous section.

DNAMLε uses the same core algorithms for maximum likelihood tree inference as the original DNAML. For every DNAML function we implemented a DNAMLε counterpart. In addition, DNAMLε optimizes the gap parameters (λ, μ). After each new branch is added to the tree using DNAML's Newton-Raphson method, DNAMLε midpoint roots the tree, and then jointly optimizes both branch lengths and gap parameters using conjugate gradient descent. Midpoint rooting is an easy but simplistic rooting method; we made no attempt in DNAMLε to optimize the placement of the root, though this would be possible without a significant efficiency cost, using methods described in [74], for example.

For simplicity, in this paper's results, we approximate the average length of ancestral sequences () by the average length of sequences in the given alignment. This is in line with common practice in maximum likelihood methods for phylogenetic inference for substitutions, which approximate the prior residue distribution at the root by the observed residue frequencies in the data [24]. In Bayesian methods, the root residue distribution and other parameters of the rate matrix are determined as part of the inference process usually in combination with Markov-chain Monte Carlo (MCMC) methods as for instance in [5]; parameter p could be estimated using similar techniques.

Given an ungapped alignment, DNAMLε and DNAML are expected to produce identical results (same unrooted tree with same likelihood, same branch lengths, and confidence limits). Given a gapped alignment, comparison of the two implementations allows us to ask how much performance improves when gaps are treated by our extended model as opposed to treating gaps as missing data.

Benchmarking

We compared DNAMLε and DNAML in three types of benchmarking experiments. First, using simulated ungapped alignment data, we confirmed that the two programs give essentially identical results when no insertions and deletions are present. Second, we assessed the ability of the two programs to accurately infer phylogenetic tree topologies for simulated alignment data with insertions and deletions from arbitrarily generated trees. Third, we assessed the programs' ability to correctly infer phylogenetic trees on real ribosomal RNA data.

The problem with benchmarking phylogenetic inference methods is knowing the ground truth. Evolutionary trees are only experimentally known in a few unusual cases for short time scales and rapidly evolving organisms; all other trees have been inferred. The strength of simulated data experiments is that the tree is known, but their weakness is that the simulation must assume an evolutionary model that may be biologically unrealistic and/or too similar to the evolutionary model assumed by the inference method to be tested. On the other hand, the strength of testing on real biological alignments is that the data are realistic, and the weakness is that the true tree is unknown. To evaluate inferences on real data, we have developed a “concordance test” that does not rely on the true tree being known. Testing on both simulated and real data should help compensate for weaknesses of either approach.

Tests on simulated ungapped alignments.

We first made sure that DNAMLε and DNAML produce essentially identical inferences when no insertions or deletions are present. This is a control experiment, making sure that DNAMLε correctly infers that optimal deletion and insertion rates are 0 on ungapped alignments, and that the likelihood calculation correctly reduces to the original ungapped version.

We generated simulated 8-taxon rooted trees using the algorithm of Kuhn and Felsenstein [75], which samples a variety of branch lengths and topologies. Each tree was rescaled to a chosen average branch length.

We used the program ROSE to generate simulated alignments from these sampled phylogenetic trees [76]. We use ROSE because it allows us to implement reasonably realistic models of insertion and deletion, which we will describe and use in the next section; in these initial ungapped control experiments, we turn off the insertion and deletion parameters. We modified ROSE to use the F84 rate matrix model for residue substitution, the same model used in DNAML and DNAMLε. We used a uniform stationary distribution of 25% for each residue, and a transition/transversion ratio of 2.0 (the DNAML default).

We sampled trees for 9 different average branch lengths, ranging from 0.005 to 2.0 substitutions/site and branch in roughly 2–3× multiplicative steps (corresponding to average pairwise identities ranging from 98% down to a fully saturated 25% in the 8 sequences). (The units of evolutionary time are in principle arbitrary. Substitution rate matrices are traditionally normalized to units of substitutions/site. DNAMLε reports time in units of changes per site, where changes include substitutions, insertions, and deletions.) For each choice of average branch length, we sample 100 different trees. For each tree, we generate different alignment lengths ranging from 50 to 1000 in steps of 5. A total of 900 trees and 171900 alignments were generated (100 samples each for 9 choices of branch length and 191 choices of alignment length). For each alignment, we infer a maximum likelihood tree topology using DNAMLε and DNAML.

To evaluate the correctness of inferred tree topologies, we computed three standard measures: the fraction of correct topologies (true positives, TP), the symmetric difference distance (SDD) [77], and the branch score distance (BSD) [75]. TP simply counts an inferred topology as right or wrong (high TP is better). SDD counts the number of non-identical internal branches in the two (unrooted) trees, where “identical” means splitting the two trees into the same disjoint sets of taxa; for a comparison of trees of 8 taxa, SDD ranges from 0 for identical trees to 10 for maximally dissimilar topologies. BSD evaluates not just the topology but also the correctness of inferred branch lengths, by summing the cost assigned to each internal branch of the square of the difference to the identical branch in the other tree (if there is no identical branch in the other tree, the cost is the whole branch square). Because BSD depends on total tree branch length, in order to compare results across different branch lengths, we calculate a normalized BSD (nBSD) on trees rescaled to an average branch length of one. Both SDD and BSD were calculated using the PHYLIP program TREEDIST. Better inferences are indicated by larger TP and by smaller SDD and nBSD.

The results are shown in Figure 2. As expected, the accuracy of tree topologies inferred by the two methods is essentially identical by the TP and SDD measures. As in any phylogenetic inference method, accuracy improves with alignment length (more data is better), and shows an optimum average branch length of about 0.05, corresponding to average pairwise sequence identities of about 82% (more similar sequences have fewer substitutions and less signal, and less similar sequences are more saturated). In the best cases (0.05 branch length, alignments longer than 800 nt) both methods infer the correct tree about 78% of the time (Figure 3A). nBSD values are also essentially identical for most choices of average branch length. We do see significant differences in branch length estimation (either by the nBSD test, or by plotting average inferred branch length) at large, saturating choices of average branch lengths of 1.0 or 2.0, corresponding to almost uncorrelated random sequences. In this extreme (and not biologically relevant) regime, branch lengths make little difference in likelihood so long as they are large (and accordingly, likelihoods assigned by the two methods are not significantly different, despite the different inferred branch lengths), and the inferred branch lengths become sensitive to details of the optimization method. Thus these extreme cases appear to be identifying a biologically irrelevant weakness in DNAML's optimizer, which infers overly large branch lengths (5.7±0.5 instead of 2.0 for the largest alignments), whereas DNAMLε results, produced by a different implementation of optimizer, are closer to the correct value (1.55±0.05). Overall, with the exception of this minor difference in numerical optimization, these results indicate that DNAMLε and DNAML do produce essentially identical results on ungapped alignments, as expected.

thumbnail

Figure 2. Comparison of DNAML versus DNAMLε for ungapped alignments.

Tree reconstruction for ungapped alignments generated according to a F84 substitution model (2.0 transition to transversion ratio and equiprobable residues), for nine different time divergences, ranging from 0.005 to 2.0 substitutions per site and branch. For a given divergence value, 100 random trees with eight taxa were used. For each tree, single alignments were generated with lengths ranging from 50 to 1000 residues in 5 residue increments. For each alignment, a tree was inferred using the programs DNAML and DNAMLε. Results are displayed as a function of the length of the alignments. (A) Fraction of trees which topology was correctly inferred as a function of the alignment length. The best performance occurs for alignments that contain about 18% pairwise substitutions on average (0.05 substitutions per site and branch). In this case, detectability seems to asymptote to approximately 78% for alignments of at least 800 residues. (B,C) Corresponding results when using the SDD and nBSD measures respectively. (D) Average mean branch length for each length bin. Overall, the two methods show similar performance for ungapped alignment. We mark with an arrow some extreme cases in which the two methods perform differently when inferring the tree branch lengths. (E) Comparison of computational time performance.