Wrote the paper: RKB CD LP. Led the development of FSA, wrote most of the code base, and developed the query-specific learning method: RKB. Redesigned the sequence annealing algorithm, constituted the core development team, and managed the project: RKB CD LP. Developed the GUI: AR. Developed a preliminary version of the GUI: MS. Developed the iterative refinement technique: SJ. Developed the parellelization and database modes: JD CD. Provided advice on the dart library, including its algorithms, programming and software components: IH. Created the FSA webserver: LP.
The authors have declared that no competing interests exist.
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our
Biological sequence alignment is one of the fundamental problems in comparative genomics, yet it remains unsolved. Over sixty sequence alignment programs are listed on Wikipedia, and many new programs are published every year. However, many popular programs suffer from pathologies such as aligning unrelated sequences and producing discordant alignments in protein (amino acid) and codon (nucleotide) space, casting doubt on the accuracy of the inferred alignments. Inaccurate alignments can introduce large and unknown systematic biases into downstream analyses such as phylogenetic tree reconstruction and substitution rate estimation. We describe a new program for multiple sequence alignment which can align protein, RNA and DNA sequence and improves on the accuracy of existing approaches on benchmarks of protein and RNA structural alignments and simulated mammalian and fly genomic alignments. Our approach, which seeks to find the alignment which is closest to the truth under our statistical model, leaves unrelated sequences largely unaligned and produces concordant alignments in protein and codon space. It is fast enough for difficult problems such as aligning orthologous genomic regions or aligning hundreds or thousands of proteins. It furthermore has a companion GUI for visualizing the estimated alignment reliability.
The field of biological sequence alignment is very active, with numerous new alignment programs developed every year in response to increasing demand driven by rapidly-dropping sequencing costs. The list of approximately 60 sequence alignment programs on the wikipedia compilation provides a lower bound on the number of available tools and illustrates the confusing choice facing biologists who seek to select the “best” program for their studies. Nevertheless, the ClustalW program
A key issue in understanding the popularity of ClustalW is to recognize that it is difficult to benchmark alignment programs. Alignments represent homology relationships among the nucleotides, or amino acids, of the genomes of extant species, and it is impossible to infer the evolutionary history of genomes with absolute certainty. Comparisons of alignment programs therefore rely on databases of structural alignments for proteins and RNA or on gene loci or simulated data for DNA. Each type of benchmark is vulnerable to manipulation and furthermore may not represent the problem setups which are most relevant to biologists. The result is that biologists are confronted with many programs and publications, but it is frequently unclear which approach will give the best results for the everyday problems which they seek to address.
Adding to the difficulty of selecting software tools is the blur between programs and ideas. Developers of alignment programs make choices about the objective functions to optimize, the statistical models to use, and the parameters for these models, but the relative impact of individual choices is rarely tested
In lieu of these issues, biologists have favored the conservative approach of using the tried and trusted ClustalW program, although they frequently use it in conjunction with software which allows for manual editing of alignments
We therefore approached the alignment problem with the following goals in mind:
An approach which seeks to maximize the expected alignment accuracy. Our approach seeks to find the alignment with minimal expected distance to the true alignment of the input sequences, where the true alignment is treated as a random variable, with the probability of each true alignment determined under a statistical model. Explicitly incorporating a statistically-motivated objective function, this “expected accuracy” approach to alignment allows us to visualize alignments according to estimates of different quality measures, including their expected accuracy, sensitivity, specificity, consistency and certainty. We therefore offer biologists a way to compare alignments that allows them to quantitatively judge differences in alignment quality when they are performing manual edits.
A method which is robust to variation in evolutionary parameters. We sought a method which is robust to phenomena such as sequences of differing evolutionary distances and base composition. While “phylogenetic alignment” methods seek to accomplish this by explicitly modeling alignments on trees
Robust results when faced with the wide range of alignment problems encountered today. We sought to create a single program which is capable of achieving high accuracies on protein, RNA and DNA sequences without additional input from, e.g., database homology searches. We additionally sought to make our approach fast enough for large-scale problems such as aligning many sequences or orthologous regions of genomes. (When aligning genomic-size sequences, we assume that the sequences are collinear; we do not attempt to solve the problem of resolving duplications or inversions.)
Creation of a modular code base so that future improvements in one aspect of alignment could easily be incorporated into our approach. In particular, we aimed to create a collaborative infrastructure so that “bioinformaticians with expertise in developing software for comparing genomic DNA sequences [can] pool their ideas and energy to produce a compact tool set that serves a number of needs of biomedical researchers”
The “distance-based” approach to sequence alignment, proposed in
We have implemented our approach in FSA, a new alignment program described below. We give an overview of the structure of FSA and explain the details of its components below.
The algorithms that are used in each component are highlighted in the accompanying boxes. The bold arrows show the simplest mode of use for FSA, where posterior probabilities are calculated directly using default parameters for all pairs of sequences and the optional steps of anchor finding and iterative refinement are omitted.
The
FSA first performs pairwise comparisons of the input sequences to determine the posterior probabilities that individual characters are aligned (note, however, that it does not yet actually align any sequences). While the number of possible pairwise comparisons is quadratic in the number of sequences being aligned, giving unfavorable runtimes for datasets of many sequences, FSA overcomes this problem by reducing the number of pairs of sequences that are compared. This is accomplished using a randomized approach inspired by the Erdös-Rényi theory of random graphs, thereby drastically reducing the computational cost of multiple alignment.
After obtaining pairwise estimates of homology at the single-character level, FSA uses the sequence annealing technique
The architecture of FSA reflects the inherent modularity of the distance-based approach to alignment. FSA's inference engine uses the flexible HMMoC code-generation tool
Each of these components can be improved independently of the others, allowing for rapid future improvements in distance-based alignment. For example, FSA's entire statistical model could easily be altered to incorporate position-specific features or completely replaced with a discriminative or hybrid generative-discriminative model.
The components described here correspond roughly to the simplest mode of operation for FSA, outlined in bold in
FSA accepts FASTA-format input files and produces alignments in multi-FASTA or Stockholm format. The server also outputs PHYLIP and ClustalW formatted files.
Distance-based alignment, relying on pairwise estimations of homology, operates on the posterior probability distributions that characters in two sequences are aligned. FSA uses the standard three or five-state pair hidden Markov model (Pair HMM) shown in
By default FSA uses a Pair HMM with two sets of Insert (I) and Delete (D) states to generate a two-component geometric mixture distribution. FSA can optionally use a three-state HMM, which has only one set of Insert and Delete states. M is a Match state emitting aligned characters.
The standard Forward-Backward algorithm on a Pair HMM has time complexity
After calculating the posterior probabilities of alignment for characters in all sequence pairs,
After estimating these posterior probabilities and sorting them, FSA creates a multiple alignment with the sequence annealing technique
The distance
The alignment with the minimum expected distance to the truth is equivalent to the alignment with the maximum expected accuracy,
The posterior probabilities over alignments
Using this expected accuracy as an objective function for a greedy maximization, sequence annealing begins with the null alignment (all sequences unaligned) and merges single columns (aligns characters) according to the corresponding expected increase in
“The mathematics of distance-based alignment” in
In FSA's definition of an alignment, an alignment consists solely of a specification of homology. This definition differs from the standard definition of a multiple alignment, which implies an evolutionary history of substitution and indel events. For example, FSA (internally) considers the two alignments shown in
“The mathematics of distance-based alignment” in
This is problematic for comparative genomics analyses which use an alignment to infer evolutionary parameters such as indel frequencies across a clade. In order to output a single global alignment from which such evolutionary parameters can be inferred, we choose a topological ordering of the alignment POSET which corresponds to a maximum-parsimony interpretation of indel events. FSA outputs the global alignment with the minimum number of “gap openings” across the individual sequences (the right-hand alignment in
FSA can efficiently align hundreds or even thousands of sequences. By default it performs exhaustive distance-based alignment, which requires an all-pairs comparison of the
In order to ensure a complete alignment, where no sequence is left unaligned, each sequence must be connected to every other sequence by a series of pairwise comparisons. For
We therefore chose to use a completely-randomized approach inspired by results from the theory of Erdös-Rényi random graphs. By the Erdös-Rényi theory, a random graph will almost surely be connected if the edge-creation probability satisfies
This speedup reduces the time complexity of both inference and sequence annealing. The cost of inference is reduced to
FSA can align megabase-long sequences using an “anchor annealing” strategy. Analogously to other genome alignment programs such as MAVID
FSA utilizes its distance-based approach to find a consistent set of anchors across all sequences simultaneously, thereby making maximal use of additional constraints from other sequences. This “anchor annealing” strategy is conceptually similar to the procedures used in programs for aligning long sequences such as CHAOS/DIALIGN, MAVID, Pecan and TBA, which return partially-ordered sets of anchors, thereby permitting constraints to be projected across multiple sequences.
As with sequence annealing, this “anchor annealing” can be accomplished efficiently with a greedy algorithm based on the Pearce-Kelly algorithm. FSA uses the same code for both sequence and anchor annealing, although the objective function is different: Anchor “scores” correspond to
Rather than aligning entire anchors across the multiple alignment in order to find a consistent set of anchors, FSA finds a set of anchor centroids which are consistent across all sequences and then prunes the resulting anchors at a pairwise level. The result is a set of anchors between pairs of sequences whose centroids are consistent across all sequences and which are non-overlapping between pairs of sequences. This heuristic approach permits FSA to quickly find consistent anchors across many sequences.
After finding a consistent set of anchors across the multiple alignment, FSA assumes that these anchors are correctly aligned with probability ∼1 and then aligns the regions between anchors using dynamic programming. When anchored alignment is performed, the posterior probabilities that individual characters are aligned, which FSA uses to inform construction of the multiple alignment, are conditioned on the set of anchors chosen. Therefore, if all anchors correspond to true homology then these probabilities will be correctly estimated despite the anchoring heuristic; however, if incorrect anchors are chosen, then individual probabilities of alignment can be similarly incorrect.
While FSA's restriction of candidate anchors from MUMmer to MUMs produces a very specific set of anchors, the restriction can be too stringent to obtain sensitive alignments of diverged or very long sequences, for which there are few unique exact matches. To address this potential problem, FSA can use the fast homology-search program exonerate
Because FSA uses the fast tool MUMmer to find anchors, it can rapidly align closely-related sequences, such as mitochondrial DNA, for which anchors span most of the alignment, making costly dynamic programming largely unnecessary.
The distinct functional constraints acting on biological sequences give rise to very different patterns of molecular evolution, each implying distinct parameterizations of an appropriate model for alignment. For example, if substitutions or indels are more frequent in one lineage than in the others, then using a single model for all sequences (which does not reflect these differing constraints) can give misleading results. Nonetheless, sequence alignment algorithms traditionally use a single model for all sequences.
In order to overcome these difficulties, FSA uses “query-specific learning,” wherein a different model is learned for each pairwise comparison (the “query”). This is done in a completely unsupervised framework: FSA uses an unsupervised Expectation Maximization (EM) algorithm to estimate transition (gap) and emission (substitution) probabilities of the Pair HMM on the fly.
Despite its unsupervised nature, FSA's query-specific learning needs remarkably little sequence data to effectively learn parameters. Standard alignment algorithms estimate parameters from thousands or tens of thousands of pairs of
While FSA learns distinct transition parameters for every pair of query sequences regardless of the sequence composition, it uses different learning strategies for nucleotide and amino acid emission matrices. Because a pair emission matrix over aligned nucleotides has only (42−1) = 15 free parameters, FSA can learn a different emission distribution for every pairwise comparison of all but the shortest RNAs or DNAs (this allows FSA to learn a different model for each pair of unanchored subsequences when performing anchored aligment). In contrast, emission matrices over aligned amino acids have (202−1) = 3,999 free parameters, thereby preventing FSA from learning independent models for each pair of proteins. FSA therefore learns a single emission matrix using an all-pairs comparison for protein sequences.
Because FSA uses unsupervised learning on very sparse data, overfitting is a serious concern. FSA attempts to prevent overfitting by (1) using a weak Dirichlet regularizer (prior) when estimating both transition and emission probabilities, and (2) terminating parameter learning before the EM algorithm converges. By default the Dirichlet emission priors are scaled such that total number of pseudocounts for aligned characters is equal to the number of free parameters in a symmetric pair emission matrix. As is the case for other machine-learning algorithms, it can be shown that termination before convergence of query-specific learning is equivalent to a form of regularization (likelihood penalty).
If there is insufficient sequence data for effective learning, then FSA does not estimate parameters and instead uses default parameters to construct an alignment. The default parameters values, as well as seeds used for the learning algorithm, are discussed in
While FSA can align hundreds or thousands of sequences on a single computer, it can handle truly large-scale problems by running in a parallelized mode on a computer cluster. FSA's distance-based approach to alignment gives the multiple alignment a natural independence structure: each pairwise alignment is independent of all other pairs, allowing dramatic runtime reduction by distributing the individual pairwise computations to different processors.
Two factors were considered for the parallelization of FSA : (1) communication overhead between nodes, and (2) workload distribution over the different processors. For example, distributing jobs in very small batches may reduce processor idle time but lead to high overhead; in contrast, using large batches may increase idle time but minimize overhead. FSA's parallelization mode uses the “Fixed-Size Chunking” strategy described in
While the pairwise comparisons can be naturally parallelized, sequence annealing does not have the same obvious independencies. Therefore, even when running in parallelized mode, FSA performs sequence annealing on a single node. The parallelization of the annealing step is a future aim for this project. A schematic of the current parallelization strategy is given in
For large input sizes, a disk-based database may be used to store some of the primary data structures and reduce memory usage.
As a greedy algorithm, sequence annealing is only guaranteed to find a local optima of the expected accuracy. FSA therefore uses an iterative refinement strategy after sequence annealing terminates to locally improve the alignment. For each round of iterative refinement, FSA looks at every character's position in the multiple alignment and sees whether the objective function can be improved by moving it to another position (without violating the consistency constraints of the multiple alignment). FSA assembles a list of such candidate character shifts, orders the list by the expected change in the objective function, and then performs the shifts. Iterative refinement terminates when there are no candidate shifts which improve the objective function.
FSA's included GUI permits the user to visually assess alignment quality under FSA's statistical model according to estimates of different measures, including expected accuracy, sensitivity, specificity, consistency and certainty. This permits biologists and bioinformaticians to incorporate reliability measures into downstream analyses, such as evolutionary rate measurements and phylogenetic reconstruction. Incorporating such information can produce distinctly different results. For example, over-aligned non-conserved sequence can cause a systematic bias towards long branch lengths; this can be ameliorated by incorporating the expected accuracy statistics produced by FSA into reconstruction algorithms.
FSA's alignment is colored according the expected accuracy under FSA's statistical model (top) as well as according to the “true” accuracy (bottom) given from a comparison between FSA's alignment and the reference structural alignment. It is clear from inspection that accuracies estimated under FSA's statistical model correspond closely to the true accuracies. Sequences are from alignment BBS12030 in the RV12 dataset of BAliBASE 3
FSA's GUI can color alignments according to five different measures of alignment quality, which are approximated under its statistical model. Characters
Accuracy: The expected accuracy with which
Sensitivity: The expected sensitivity with which
Specificity: The expected specificity with which
Certainty: The certainty with which
Consistency: The consistency of the many pairwise comparisons used to construct the multiple alignment and the implied optimality of the alignment of
See
The GUI also provides a visual and statistical guide when manually editing alignments.
We benchmarked FSA against databases of multiple alignments compiled from reference structural alignments, including protein databases (BAliBASE 3
Alignment programs are commonly used to detect homology among input sequences. We conducted a series of false-positive experiments to test whether commonly-used alignment programs can reliably identify homologous and non-homologous sequence. Surprisingly, we found that for most alignment programs, aligned sequences are not necessarily homologous, indicating that biologists should use caution when interpreting the multiple alignments produced by many commonly-used tools.
We additionally performed a simple test of the consistency of common programs when aligning coding sequence: We aligned 1,502 genes orthologous across seven species of yeast in both nucleotide and protein space and compared the resulting alignments. Many programs gave surprisingly discordant results, indicating that at least one of these two alignments produced by commonly-used programs is incorrect.
Program | BAliBASE 3 | BAliBASE 3+fp | SABmark 1.65 |
(Acc/Sn/PPV) | (Acc/Sn/PPV) | (Acc/Sn/PPV) | |
AMAP | 0.70/0.62/0.83 | 0.73/0.61/0.80 | 0.57/0.43/0.46 |
ClustalW | 0.66/0.63/0.62 | 0.59/0.63/0.53 | 0.38/0.44/0.30 |
DIALIGN | 0.68/0.63/0.68 | 0.68/0.62/0.63 | 0.48/0.41/0.34 |
FSA | 0.71/0.62/0.85 | ||
FSA (–maxsn) | 0.73/0.68/0.76 | 0.74/0.68/0.72 | 0.52/0.45/0.39 |
MAFFT | 0.74/0.71/0.71 | 0.68/0.71/0.61 | 0.44/0.49/0.35 |
MUMMALS | 0.74/0.70/0.73 | 0.69/0.70/0.64 | 0.49/0.52/0.38 |
MUSCLE | 0.70/0.67/0.66 | 0.63/0.66/0.57 | 0.40/0.46/0.32 |
Probalign | 0.71/0.71/0.65 | 0.49/0.50/0.37 | |
ProbCons | 0.74/0.70/0.72 | 0.69/0.70/0.64 | 0.47/0.50/0.37 |
T-Coffee | 0.72/0.67/0.71 | 0.67/0.67/0.63 | 0.45/0.46/0.35 |
SeqAn::T-Coffee | 0.73/0.69/0.70 | 0.67/0.69/0.61 | 0.43/0.47/0.34 |
Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on the BAliBASE 3
In order to test the robustness of alignment programs to incomplete homology, we modified the BAliBASE 3 database such that every alignment included a single false-positive, an unrelated (random) sequence. This is a realistic setup for biologists who might want to decide whether a sequence is orthologous to a particular protein family. With the exception of FSA, the tested alignment programs suffered a false-positive rate increased by over 25% on this BAliBASE 3+fp dataset, indicating that they aligned the random sequence with the homologous set. In contrast, FSA left the random sequence unaligned and had an essentially-unchanged false-positive rate.
Program | BRAliBase 2.1 | Consan mix80 |
(Acc/Sn/PPV) | (Acc/Sn/PPV) | |
ClustalW | 0.85/0.86/0.86 | 0.65/0.65/0.68 |
DIALIGN | 0.82/0.83/0.85 | 0.76/0.75/0.82 |
FSA | 0.90/0.91/0.94 | 0.77/0.74/0.92 |
FSA (–maxsn) | ||
MAFFT | 0.90/0.91/0.91 | 0.77/0.78/0.77 |
MUSCLE | 0.90/0.91/0.90 | 0.74/0.76/0.74 |
ProbConsRNA | 0.91/0.92/0.92 | (failed to align) |
T-Coffee | 0.81/0.82/0.84 | 0.38/0.33/0.40 |
SeqAn::T-Coffee | 0.89/0.90/0.90 | (failed to align) |
Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on the BRAliBase 2.1 dataset of small RNAs
BRAliBase 2.1 was assembled from the RFAM
The Consan mix80 dataset of alignments of Small and Large Subunit ribosomal RNAs, assembled from the European Ribosomal RNA database
Program | Blanchette et al. | DAWG | simgenome |
(Acc/Sn/PPV) | (Acc/Sn/PPV) | (Acc/Sn/PPV) | |
CHAOS/DIALIGN | 0.58/0.44/0.74 | 0.72/0.46/0.43 | 0.62/0.67/0.59 |
DIALIGN-TX | 0.73/0.68/0.77 | 0.72/0.51/0.44 | 0.64/0.68/0.61 |
FSA (–exonerate) | 0.86/0.82/0.93 | ||
FSA (–exonerate –maxsn) | 0.87/0.85/0.90 | 0.75/0.41/0.50 | 0.76/0.79/0.77 |
MAVID | 0.57/0.45/0.68 | 0.66/0.36/0.32 | 0.72/0.77/0.72 |
MLAGAN | 0.70/0.63/0.80 | 0.45/0.39/0.19 | 0.71/0.71/0.73 |
Pecan | 0.77/0.48/0.53 | 0.78/0.81/0.78 | |
TBA | 0.83/0.81/0.87 | 0.80/0.32/0.75 | 0.74/0.79/0.72 |
Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on simulated alignments of mammalian and
The simulated alignments of nonfunctional DNA sequences from nine mammals (human, chimp, baboon, mouse, rat, cat, dog, cow, and pig) were created by Blanchette et al.
FSA's strong performance on all three datasets of simulated long DNA sequences indicate that it is a useful and accurate tool for genomic alignment.
In order to further test the appropriateness of using popular alignment programs to detect homology between sequences, we conducted a large-scale random-sequence experiment. We generated datasets of random sequences to simulate unrelated protein, short DNA, and genomic (long) DNA sequences. The results, shown in
Program | Protein | DNA |
AMAP | 14% | n/a |
ClustalW | 97% | 95% |
DIALIGN | 24% | 17% |
FSA | ||
FSA (–maxsn) | 21% | 17% |
MAFFT | 83% | 93% |
MUMMALS | 63% | n/a |
MUSCLE | 89% | 80% |
Probalign | 44% | n/a |
ProbCons | 51% | 77% |
T-Coffee | 63% | 75% |
SeqAn::T-Coffee | 74% | 78% |
Large-scale random sequence tests indicate that for most alignment programs, aligned sequences are not necessarily homologous (table shows the fraction of random sequence aligned, calculated by taking a sum-of-pairs over pairwise alignments). Even when run in maximum-sensitivity mode (–maxsn), FSA aligned only a small fraction of the random sequence. We generated 50 datasets, each with 10 random sequences, and ran all programs with default parameters. Protein sequences were 300 aa in length and DNA sequences were 1,000 nt in length. Results reported for ProbCons on DNA sequences were obtained with ProbConsRNA.
Program | Genomic DNA |
CHAOS/DIALIGN | 10% |
ClustalW | 96% |
DIALIGN-TX | 20% |
FSA (–exonerate) | 1% |
FSA (–exonerate –maxsn) | 4% |
MAVID | 17% |
MLAGAN | 30% |
Pecan | 1% |
TBA |
Large-scale random sequence tests for genomic alignment programs. As in
Biologists commonly align coding regions in both amino acid and nucleotide space, but there have been few studies of the effectiveness of alignment programs across the two regimes. We tested the consistency of alignment programs on coding sequence by aligning all 1,502 genes in
Program | Alignment similarity (average) |
ClustalW | 0.914 |
DIALIGN | 0.912 |
FSA | |
FSA (–noanchored) | |
MAFFT | 0.932 |
MUSCLE | 0.915 |
ProbCons | 0.902 |
T-Coffee | 0.897 |
SeqAn::T-Coffee | 0.905 |
We assessed the concordance between alignments obtained in nucleotide and amino acid space by aligning all 1,502 genes in
This simple test additionally indicates the effectiveness and robustness of FSA's query-specific learning. While very different learning procedures are used for amino acid and nucleotide sequence, the concordant alignments inferred by FSA indicate that our results are robust with respect to the details of the learning procedure.
We conducted an ablation analysis of FSA's components to test the effectiveness of each component and ensure that they all contributed to the accuracy of the program. As indicated by the results in
FSA options | BAliBASE 3 | BAliBASE 3+fp | SABmark 1.65 |
(Acc/Sn/PPV) | (Acc/Sn/PPV) | (Acc/Sn/PPV) | |
(default) | 0.71/0.62/0.85 | ||
–fast | 0.70/0.61/0.85 | 0.74/0.62/0.84 | 0.59/0.37/0.52 |
–nolearn | 0.72/0.65/0.81 | 0.75/0.65/0.79 | 0.56/0.44/0.44 |
–refinement 0 | 0.70/0.61/0.85 | 0.74/0.61/0.84 | 0.59/0.37/0.52 |
–noindel2 | 0.70/0.61/0.85 | 0.74/0.60/0.84 | 0.59/0.38/0.52 |
–maxsn | 0.74/0.68/0.72 | 0.52/0.45/0.39 | |
–fast –maxsn | 0.73/0.67/0.76 | 0.73/0.67/0.71 | 0.52/0.44/0.39 |
–nolearn –maxsn | 0.73/0.68/0.74 | 0.70/0.68/0.67 | 0.49/0.47/0.37 |
–refinement 0 –maxsn | 0.72/0.66/0.78 | 0.73/0.66/0.73 | 0.53/0.43/0.39 |
–noindel2 –maxsn | 0.73/0.68/0.76 | 0.72/0.68/0.70 | 0.51/0.45/0.39 |
Ablation analysis of FSA on the protein benchmarks of
FSA options | BRAliBase 2.1 | FSA options | Consan mix80 |
(Acc/Sn/PPV) | (Acc/Sn/PPV) | ||
(default) | 0.90/0.91/0.94 | ||
–fast | 0.90/0.91/0.94 | –fast | 0.77/0.74/0.92 |
–nolearn | 0.91/0.92/0.93 | –nolearn –fast | 0.77/0.74/0.93 |
–refinement 0 | 0.90/0.91/0.93 | –refinement 0 –fast | 0.73/0.69/0.94 |
–noindel2 | 0.91/0.92/0.93 | –noindel2 –fast | 0.73/0.69/0.91 |
–noanchored –fast | 0.77/0.74/0.93 | ||
–maxsn | |||
–fast –maxsn | 0.91/0.92/0.92 | –fast –maxsn | 0.78/0.78/0.86 |
–nolearn –maxsn | 0.91/0.92/0.92 | –nolearn –fast –maxsn | 0.78/0.78/0.85 |
–refinement 0 –maxsn | 0.90/0.91/0.93 | –refinement 0 –fast –maxsn | 0.74/0.70/0.92 |
–noindel2 –maxsn | 0.91/0.92/0.92 | –noindel2 –fast –maxsn | 0.74/0.73/0.84 |
–noanchored –fast –maxsn |
Ablation analysis of FSA on the RNA benchmarks of
FSA options | Blanchette et al. |
(Acc/Sn/PPV) | |
(default) | 0.53/0.32/0.93 |
–exonerate | 0.83/0.77/0.94 |
–exonerate –minscore 50 | |
–exonerate –refinement 0 | 0.82/0.76/0.93 |
–exonerate –noindel2 | 0.78/0.72/0.94 |
Ablation analysis of FSA on the simulated mammalian DNA of
FSA options | Protein | DNA |
(default) | 4% | 5% |
–fast | 4% | 5% |
–nolearn | 13% | 8% |
–refinement 0 | ||
–noindel2 | 5% | 10% |
–maxsn | 21% | 17% |
–fast –maxsn | 22% | 17% |
–nolearn –maxsn | 30% | 16% |
–refinement 0 –maxsn | 19% | 15% |
–noindel2 –maxsn | 27% | 21% |
Ablation analysis of FSA on the unrelated sequence benchmarks of
The importance of each component depends strongly upon the alignment problem. The –fast heuristic for reducing the number of sequence pairs used to construct an alignment results in little loss of accuracy, at least on the benchmarks used in this paper (
Biologists commonly perform alignments of hundreds or thousands of 16S ribosomal DNA sequences in order to elucidate evolutionary relationships and build phylogenetic trees. We performed alignments of prokaryotic 16S sequences to compare the speed of commonly-used programs (
Program | 100 | 200 | 300 | 400 | 500 seqs |
ClustalW | 1,194 s | 4,147 s | 9,110 s | 16,187 s | 27,755 s |
DIALIGN | 4,346 s | 19,449 s | 49,388 s | (fail) | (fail) |
FSA –fast | 1,513 s | 3754 s | 5,641 s | 9,767 s | 15,683 s |
FSA –fast –noindel2 –refinement 0 | 638 s | 1,495 s | 2,467 s | 3,604 s | 5,154 s |
MAFFT | |||||
MUSCLE | 351 s | 1,235 s | 1,516 s | 4,384 s | 7,552 s |
ProbConsRNA | 16,319 s | (fail) | (fail) | (fail) | (fail) |
T-Coffee | 1,362 s | 3,666 s | 7,880 s | 15,254 s | 22,085 s |
SeqAn::T-Coffee | 3,024 s | (fail) | (fail) | (fail) | (fail) |
Comparison of runtimes of FSA and other alignment methods when aligning 16S ribosomal sequences. MAFFT was faster than any other method by an order of magnitude; the next-fastest programs were MUSCLE and FSA. FSA can be made substantially faster by using a 3-state, rather than the default 5-state, HMM (with little loss of accuracy; see
The results in
FSA options | 100 | 200 | 300 | 500 | 1,000 seqs |
FSA | 6,407 s | 27,534 s | — | — | — |
FSA –parallelize 10 | 819 s | 5,713 s | 22,113 s | — | — |
FSA –fast | 1,650 s | 3,781 s | 6,207 s | 12,249 s | — |
FSA –fast –parallelize 10 | 201 s | 513 s | 924 s | 2,511 s | 15,179 s |
Runtimes for FSA in regular, –fast and –parallelize modes when aligning the 16S sequences of
1 | 5 | 10 | 15 | 20 processors | |
100 seqs | 1,650 s | 365 s | 214 s | 135 s | 105 s |
200 seqs | 3,781 s | 889 s | 506 s | 385 s | 355 s |
Runtimes for FSA in –fast –parallelize P mode as a function of the number of processors P in the computer cluster with a 3-state HMM (–noindel2) and refinement disabled (–refinement 0). Sequences and cluster specifications are same as for
In the Introduction we highlighted four design criteria which we emphasized in the development of FSA. The first goal was to find alignments with high expected accuracy, where an accurate alignment has minimal distance to the truth. This objective function is markedly different from both the maximum-likelihood approaches used by programs such as ClustalW and MUSCLE and the maximum expected sensitivity approaches used by programs such as ProbCons and Pecan. Note that while the objective function used in ProbCons is called “maximum expected accuracy” in the paper
We believe that this accuracy criterion, which gives equal weight to the correctness of all sequence positions, is a natural measure of alignment quality. Downstream analyses, such as phylogenetic reconstruction and evolutionary constraint analysis, are increasingly using indels in addition to homologous characters for more accurate estimation (e.g.,
FSA's strong performance under the accuracy criterion is due to techniques such as its iterative refinement as well as its explicit attempt to maximize the expected accuracy; programs which explicitly seek to maximize an objective function of the posterior probabilities of character alignment, such as ProbCons or Probalign, could instead seek to maximize the expected accuracy described here and, as a likely result, increase their robustness to non-homologous sequence. However, while we believe that the expected accuracy is a biologically-sensible objective function, it may not be appropriate if the user desires the most sensitive alignment. While FSA can produce the most-sensitive RNA alignments, other programs can produce more sensitive alignments of proteins and genomic sequence, albeit generally at the cost of a tendency to align non-homologous sequence (
The second goal was to create alignments which are robust to evolutionary distances and different functional constraints on patterns of molecular evolution. FSA's unsupervised query-specific learning for parameter selection frees the user from unknown systematic biases implicitly introduced by using an alignment program whose parameters were trained on a dataset whose statistics may not reflect those of the sequences to be aligned. For example, it is traditionally challenging to align sequences with unusual base composition, but FSA can easily handle such alignments by automatically learning appropriate parameters. As indicated by our ablation analysis, query-specific learning further increases FSA's robustness to non-homologous sequences beyond that offered by the minimum-distance objective function alone.
We believe that FSA's unsupervised query-specific learning is the first time a multiple alignment program has been capable of dynamically learning a complete parameterization, wherein parameters can vary for each pair of sequences to be compared, on the fly. This learning method is related to the “pre-training” option in ProbCons, which permits users to learn different models for families of homologous sequences, but does not permit parameterizations to vary between sequence pairs. We also note that the MORPH program for pairwise alignment of sequences with
The third and fourth goals, to develop a single, modular program which can address all typical alignment problems encountered by biologists, are naturally achieved within FSA's architecture. While almost all alignment programs are designed to either align many short sequences or a few long sequences, we have demonstrated that it is feasible to create a single program which can address both situations. This is made practical by FSA's modular nature, where the statistical model for pairwise comparisons, the anchoring scheme for finding homology between long sequences, and the sequence annealing procedure are entirely separate and can be individually modified and improved. For example, the parallelization of FSA was designed and developed with minimal changes to the rest of FSA's code base. Similarly, while FSA's basic anchoring relies only on exact matches from MUMmer, the anchoring scheme was easily extended to incorporate inexact matches from exonerate
FSA is a statistical alignment program insofar as it uses an explicit statistical model of alignments and a probabilistic objective function for optimization, but as discussed in “Theoretical justification of distance-based alignment” (
Supplementary Information
(0.23 MB PDF)
FSA borrows heavily from previous work, both in its code base and its intellectual foundations.
The query-specific learning method for re-estimating alignment parameters on the fly was inspired by
The iterative refinement technique is based on ideas in
The FSA algorithm was parallelized using a modification of the approach in MW
The coloring in the GUI according to posterior probabilities of alignment is inspired by the AU viewer of BAli-Phy
FSA's query-specific learning uses code created with Gerton Lunter's HMMoC compiler for HMMs
FSA's sequence and alignment representation code was inspired by similar code in the dart library