Conceived and designed the experiments: SRE. Performed the experiments: SRE. Analyzed the data: SRE. Contributed reagents/materials/analysis tools: SRE. Wrote the paper: SRE.
The author has declared that no competing interests exist.
Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches.
Searching sequence databases is one of the most important applications in computational molecular biology. The main workhorse in the field is the BLAST suite of programs. Since the introduction of BLAST in the 1990's, important theoretical advances in homology search methodology have been made using probabilistic inference methods and hidden Markov models (HMMs). However, previous software implementations of these newer probabilistic methods were slower than BLAST by about 100-fold. This hindered their utility, because computation speed is so critical with the rapidly increasing size of modern sequence databases. Here I describe the acceleration methods I implemented in a new, freely available profile HMM software package, HMMER3. HMMER3 makes profile HMM searches about as fast as BLAST, while retaining the power of using probabilistic inference technology.
Sequence database homology searching is one of the most important applications in computational molecular biology. Genome sequences are being acquired rapidly for an ever-widening array of species. To make maximal use of sequence data, we want to maximize the power of computational sequence comparison tools to detect remote homologies between these sequences, to learn clues to their functions and evolutionary histories. The most widely used tool for sequence comparison and database search is BLAST
Since BLAST's introduction, some important advances have been made in the theory of sequence comparison, particularly by using probabilistic inference methods based on profile hidden Markov models (profile HMMs)
Nonetheless, regardless of any of the attractive advantages of HMMs, no implementation of fully probabilistic sequence comparison methods has yet approached the utility of BLAST. The most widely used implementations of profile HMM technology, including HMMER from my laboratory, have been slow and computationally expensive, on the order of 100- to 1000-fold slower than BLAST for a comparably sized search. In an era of enormous sequence databases, this speed disadvantage outweighs any advantage of HMM methods. Profile HMM methods have become important only in the niche of protein domain family analysis, where the speed differential is compensated by being able to use a single profile HMM to represent a family of hundreds of homologous individual sequences
HMMER has been a target of many acceleration and optimization efforts
Another reason for the limited impact of previous acceleration efforts is that they have almost exclusively focused on accelerating the optimal local alignment scoring algorithm (known as the Viterbi algorithm in the HMM literature) as opposed to the more desirable Forward algorithm. In part, this is because optimal local alignment algorithms are more well known, and in part it is because previous versions of HMMER itself implemented Viterbi rather than Forward scoring. Forward implementations are about 3- to 9-fold slower than Viterbi implementations, and the expected statistical distribution of Forward scores for profile HMMs was not understood well enough to assign accurate E-values (expectation values). I recently described a satisfactory solution to the latter problem
Here I describe the heuristic acceleration pipeline implemented by HMMER3, a reimplemented version of the HMMER software. In comparison to the previous version of HMMER, HMMER3 is about 100-fold faster because of the use of a new heuristic algorithm called the MSV filter, while also being significantly more powerful because it moves from optimal local Viterbi alignment to full Forward/Backward evaluation of alignment ensembles, exploiting more of the mathematical advantages of probabilistic modeling. Thus HMMER3 is now about as fast as BLAST, while extending the performance advantages of profile HMM methods.
The main algorithm that accelerates HMMER3 is called MSV, for Multiple (local, ungapped) Segment Viterbi. It was inspired by a technique used in ParAlign
A: Profile HMM architecture used by HMMER3
An MSV score is essentially analogous to BLAST's “sum score” of one or more ungapped HSPs (high scoring pairs). A difference is that MSV does not impose alignment consistency (two ungapped alignments are not required to be consistent with a single gapped alignment). In a filtering heuristic, this difference is not important. HMMER3 calculates the MSV score directly by dynamic programming, bypassing the word hit and hit extension heuristics of BLAST.
The fact that MSV essentially bypasses two of BLAST's main heuristics provides an intuitive argument why MSV scores are expected to be a more sensitive overall heuristic than BLAST's approach. However, I have not attempted to rigorously compare the performance of HMMER's MSV heuristic to other acceleration heuristics such as those in BLAST or FASTA.
The HMMER3 implementation takes advantage of several synergistic statistical and computational features of the MSV model. I summarize these features here before describing them in detail:
MSV alignment scores can be calculated efficiently using so-called “striped” vector-parallel techniques originally developed for Smith/Waterman local sequence alignment
Because the MSV model gives predictable score distributions for nonhomologous sequences, with scores confined to a narrow range that is largely independent of query and target sequence characteristics, MSV values can be approximated with reduced precision (8 bits, in a score range of 0–255). This allows a 16-fold vector parallelism in current commodity processors with 128-bit vector registers.
The MSV model remains a full probabilistic local alignment model, so MSV scores obey conjectures about the expected Gumbel distribution of probabilistic local alignment scores
Because we can calculate MSV P-values, we can use MSV scores as a tunable and selective sequence filter. If a target sequence has an MSV score with a P-value less than a chosen threshold, we pass the entire sequence to more accurate and computationally intensive scoring algorithms. By definition, the P-value threshold is the fraction of nonhomologous sequences expected to pass the filter.
The MSV filter is a heuristic acceleration, not guaranteed to find all high-scoring targets. Overall performance of the HMMER3 acceleration pipeline in terms of speed, specificity, and sensitivity depends on several issues and tradeoffs, including how fast the filters are, how accurately and quickly P-values can be estimated for filter scores, and whether a threshold on MSV P-values can be set to remove most nonhomologs while removing few if any true homologs that an unfiltered search would have detected. These are empirical questions, which I have addressed by benchmarking experiments.
The following sections, especially on vector parallelization and on assuring that scores can be kept in limited numeric ranges, are necessarily technical and terse. On a first reading, the reader may want to skip or skim ahead to the “HMMER3 acceleration pipeline” section to see how these technical aspects fit together into an overall scheme, and how that acceleration scheme performs.
The MSV score for target sequence
The null model
The
The only state transition parameters in the MSV model are those that control target sequence length modeling, the uniform local alignment fragment length distribution, and the number of hits to the core homology model per target sequence
Target sequence length modeling:
Uniform local alignment fragment length distribution:
Multiple hits per target:
The MSV alignment score can be calculated by a dynamic programming recursion in a two-dimensional matrix
Initialization:
Recursion:
Termination:
The
Log-odds ratio scoring relative to the null model is built into the calculation, in the match scores
Like other linear sequence alignment recursions, the algorithm requires
The MSV algorithm is highly amenable to vector parallelization using commodity SIMD (single instruction, multiple data) instructions, such as the Streaming SIMD Extensions (SSE) instructions on Intel-compatible systems and Altivec/VMX instructions on PowerPC systems. These vector instruction sets use 128-bit vectors to compute up to 16 simultaneous operations.
Several vector methods have been described for accelerating classical Smith/Waterman local sequence alignment
Striping addresses a challenge in the data dependency pattern in Smith/Waterman-style dynamic programming recursions. The calculation of each cell
The top row (magenta outline) shows one row of the dynamic programming lattice for a model of length
To maximize parallelism, I implemented MSV as a 16-fold parallel calculation with score values stored as 8-bit unsigned integers restricted to range 0‥255. This takes advantage of the fact that local alignment scores under HMMER3's probabilistic model have a narrow and predictable dynamic range, enabling a numerical stability analysis that justifies using reduced precision. (The details of this analysis are given in the next section.) This rescaling is specified by three values (base, bias, and scale), where “base” is an initial offset from zero to make MSV scores nonnegative (default: 190), “scale” is the scaling factor (default 3, so MSV scores are in units of one-third bits), and “bias” is an offset on individual residue scores, used to make all individual residue scores unsigned byte costs relative to the maximum residue score. Using the scale and bias terms, position-specific residue scores
To define MSV's SIMD recursion, I will use five pseudocode vector instructions for operations on
Operation | Pseudocode | Definition |
saturated addition |
|
|
saturated subtraction |
|
|
max |
|
|
assignment |
|
|
right shift |
|
|
horizontal max |
|
|
In this pseudocode, the vectorized MSV algorithm is the following:
Initialization:
Recursion:
Termination:
The constant term of −4.3 bits in the termination step arises from an approximation that deals with roundoff error in counting
This algorithm is implemented both for SSE and Altivec/VMX instructions in the HMMER3 source code in impl_{sse,vmx}/msvfilter.c::p7_MSVFilter().
This section is particularly technical, and may be skipped in a first reading. In reducing the dynamic range of score calculations to small unsigned integers, we must make sure that underflow or overflow either do not occur, or have no erroneous consequences. We must also be sure that the magnitude of any accumulated roundoff error is tolerable. Because the HMMER3 acceleration pipeline (described below) uses vector-parallel, striped, reduced precision implementations of both the MSV algorithm (described above) and the standard Viterbi (optimal alignment) algorithm for the original profile model with insertions/deletions, the following analysis considers both MSV and Viterbi scores.
For underflow, we use the fact that there is a lower bound on optimal local alignment scores as a function of model length
For managing overflow, we use the fact that we will only use a reduced-precision implementation as a filter on target sequences. Any sequence with a P-value
This range of −60 to 17 bits applies to
These same bounds apply to both the original local alignment model (Viterbi alignments with insertions and deletions) and the MSV model, because no step in ascertaining these bounds required any consideration of the transition probabilities in the core model (match, insertion, and delete states).
Thus we need a dynamic range of 82 bits (−61 … 21 bits), and the maximum range of an 8-bit integer is 256 values, so scaling log-odds scores to units of 1/3 bits suffices. (Coincidentally, this is comparable to the scaling and roundoff of standard scoring matrices used by BLAST or FASTA; BLOSUM45, for example, is in units of 1/3 bits.) A “base” offset term is then used to adjust the represented value range to the range of bit scores. For unsigned 8-bit integers, a base of +190 means that values 0‥255 represent the range of
Rounding scores to the nearest 1/3 bit introduces a roundoff error of
The MSV model and its implementation use several features to reduce roundoff error to tolerable limits. First, by eliminating match, delete, and insert transitions and setting all match-match transition probabilities to 1.0 (thus zero cost), the MSV model itself has already eliminated many of the transitions that accumulate non-independent roundoff error, leaving in the core model only the
Thus the roundoff error in a reduced-precision MSV algorithm implementation consists of a bias arising from treating the NN, CC, JJ contributions totalling
To confirm that this expected roundoff error agrees with empirical observation, I performed simulations in which I examined the differences in MSV scores of a reduced precision implementation (unsigned bytes in
Previously
These statistical conjectures are best obeyed by models or scoring systems with high relative entropy per position (i.e. high mean expected score)
A: example MSV score distributions for a typical Pfam model, CNP1, on
This analysis shows that in general, reasonably accurate P-values for MSV scores can be obtained. It also shows that on short sequences of
The MSV implementation described above is about 500-fold faster than a standard serial implementation of the full Forward algorithm. This means that a search will still be rate-limited by the speed of the computationally intensive Forward/Backward calculations. Suppose we allow the top 2% (1/50) of sequences through the MSV filter to full HMM Forward log-likelihood scoring; then Forward must be no more than 50-fold slower than MSV, or Forward will be rate-limiting. It was therefore necessary to seek significant accelerations of at least an order of magnitude in the implementations of the Forward and Backward algorithms.
Numerical underflow is a problem for implementing the Forward and Backward algorithms. The probability of a partial alignment path generally underflows the smallest representable floating-point value. In a Viterbi implementation, underflow is avoided by working in the log probability domain, replacing multiplication and maximization of probabilities with addition and maximization of their logarithms
In both sequence analysis and speech recognition HMM applications, this problem is customarily solved by working in the log probability domain and implementing a “log-sum” operation, such that addition
Another approach is rescaling
I use the following steps to make an approach that I call “sparse rescaling” work for HMMER3's SIMD vector implementations of the Forward/Backward algorithms.
First, Forward/Backward values are calculated in an odds-ratio domain rather than the probability domain, so they are naturally pre-scaled to some extent. Each match emission probability
Second, I exploit the fact that the HMMER3 profile HMM is a multihit local alignment model, not a glocal alignment model (glocal means global with respect to the model, local with respect to the target). When the profile HMM is a multihit local alignment model, a rescaling approach works. For any path with two aligned regions connected by a long deletion, there must exist an alternative path that counts the same two aligned regions as two local alignments, connected by a reinitiation path (M
Determining the appropriate scale factor
Still, rescaling a row also requires extra computation. Here I exploit the fact that rescaling every row is unnecessary. Instead, when the E cell odds-ratio value exceeds a certain threshold, this triggers a rescaling event for that row. Other rows have
I implemented Forward and Backward using sparse rescaling and striped SIMD vectors of four parallel 32-bit floats. Overall these implementations are about 16-fold faster than standard serial implementations using the log-sum lookup table operation. The overall 16-fold acceleration is likely a combination of about a 4-fold speedup from the SIMD vector parallelization with about 4-fold from replacing the log-sum operation with addition and multiplication. This makes the Forward/Backward algorithms only about 30-fold slower than the MSV filter.
The MSV and Forward/Backward methods described above are implemented in the so-called “acceleration pipeline” at the core of the HMMER3 software implementation (
Representative calculation speeds are shown in red, in units of millions of dynamic programming cells per second (Mc/s). Default P-value thresholds for MSV, Viterbi, and Forward filtering steps are shown in orange. The bias filter and the domain definition steps are not described in detail in this manuscript, and are shown in gray.
The pipeline either accepts or rejects the entire comparison at each step, based on the P-value of the log-odds score. For example, by default the MSV filter passes if a comparison gets a P-value of less than 0.02 (i.e., the top-scoring 2% of random nonhomologous sequences are expected to pass the filter). I have not yet explored the more sophisticated approach of using alignment information from earlier and faster steps in the pipeline to constrain (band) subsequent dynamic programming calculations.
All P-value calculations assume that the query profile and target sequence have residue compositions close to the overall average for proteins. In some cases, a query profile has a biased composition, and this bias matches a bias found in many target database sequences. Membrane proteins, for example, are skewed towards a more hydrophobic composition, and tend to match other nonhomologous membrane proteins with scores higher than expected under a simple average-composition null hypothesis. HMMER3 has methodology for recalculating scores and P-values to compensate for biased composition, but this methodology (the so-called “null2” correction, not described here for reasons of space) is placed late in the pipeline because it is computationally intensive. At the MSV filter step, the uncorrected MSV P-value may be underestimated in biased composition matches, which means more than the expected fraction of nonhomologous sequences may passes the MSV filter, which in some cases can be sufficient to slow the pipeline. HMMER3 inserts a “bias filter” step to reduce this problem. The bias filter step is shown in gray in
To further reduce the computational load that arrives at the full Forward step, an additional filter, the Viterbi filter, was implemented and inserted in the pipeline. The Viterbi filter is a striped SIMD vector implementation of optimal gapped alignment to the profile (
The Forward and Backward algorithms for the pipeline are implemented in specialized efficient-memory forms called p7_ForwardParser() and p7_BackwardParser(). Each parser stores only a single row
The acceleration pipeline is memory efficient. The MSV and Viterbi filters are only concerned with scores, not alignments, so they are implemented in linear-memory
Each point represents a speed measurement for one search with one query against
To measure the “typical” performance of each program, without complicating variation arising from producing the voluminous alignment output for some queries that hit large protein superfamilies, panel A (left) shows benchmarks done on random (shuffled) target sequences. The panel shows results for 76 query profiles (or representative single sequences), chosen to sample the full range of query lengths in the Pfam protein domain database from 7 to 2,217 residues. These results show that HMMER3 performance is comparable to other fast database search programs; somewhat slower (by about 2- to 3-fold) than NCBI BLAST, and somewhat faster (by about 3-fold) than WU-BLAST, for example. The speed of SSEARCH, the Smith/Waterman local alignment implementation in the FASTA package, is worth noting in this figure; SSEARCH has recently been accelerated by implementing Farrar's striped SIMD vector methods, allowing it to achieve speed comparable to the heuristic FASTA and WU-BLAST programs. HMMER3 is faster than HMMER2 by up to 140×, even though HMMER3 calculates full Forward scores whereas HMMER2 calculated faster Viterbi optimal alignment scores. Compared to SAM, which does calculate full Forward scores, HMMER3 is about 600× faster.
To measure a wider and more realistic range of real-world performance, panel B (right) shows benchmarks for 11,912 different queries (every Pfam 24 profile) on real sequences from UniProt TrEMBL. Programs that simply do a dynamic programming alignment to each target sequence, such as HMMER2 (orange points) or SSEARCH (not shown), show performance essentially independent of the properties of the query and target sequences. Programs that use heuristics and filters, however, are sensitive to how well a given search obeys the assumptions of the heuristic and/or filter thresholds. Both HMMER3 and PSI-BLAST speed vary not only by query length, but also vary substantially around their average for a given query length. PSI-BLAST speed in panel B varies both up (by up to about 3-fold) and down (by up to about 10-fold) from its “typical” performance in panel A, presumably reflecting variation in how many word hits and hit extensions need to be processed for a given search. HMMER3 speed tends to vary only downwards from its typical performance, by up to about 20-fold. In panel B, I highlight examples of four poorest-performing HMMER3 searches, on the DivIC, DUF972, IncA, and MFS_1 models. Even with the bias filter step included in the acceleration pipeline, the dominant cause of poorer HMMER3 search performance remains biased composition sequences (such as transmembrane proteins) in which more comparisons pass the fast filters of the acceleration pipeline than expected by P-value calculations that assume average target sequence compositions, causing more comparisons to reach the compute-intensive Forward/Backward calculations.
A filtering approach will generally compromise search sensitivity by some degree, because a filter will erroneously remove true homologs at some rate. We want this rate to be negligible. To measure how much search sensitivity is attenuated by the use of the MSV filter and the HMMER3 acceleration pipeline, I performed benchmarks to compare sensitivity/specificity of default HMMER3 hmmsearch (with the acceleration filter pipeline) to hmmsearch --max, an option that turns off all the filters and runs the full Forward scoring algorithm on every target sequence. I also benchmarked HMMER2 and several other homology search programs for comparison. These results are shown in
For different programs, searches are performed either by constructing a single profile from the query alignment (HMMER3, HMMER2, SAM, PSI-BLAST), or by using “family pairwise search”
These benchmarks are automatically and semi-randomly generated by a program (create-profmark). The program starts from a source of trusted alignments of homologous protein domains (here, Pfam 24 seed alignments), a source of typical full-length protein sequences (here, UniProt SwissProt 2011_03), and a choice of method for synthesizing nonhomologous sequence segments (such as shuffling a randomly chosen segment of a UniProt sequence). A query alignment and a set of true test domains (trusted to be homologous to the query) is created by applying single-linkage-clustering by percent identity to a Pfam alignment, and using that clustering to select sequences such that no true test domain has more than 25% pairwise identity to any sequence in the query alignment, and no more than 50% pairwise identity to any other test domain. True test sequences are created by concatenating one or two test domains together with nonhomologous sequence segments, with a total sequence length sampled from the distribution of UniProt sequences. False (nonhomologous) test sequences are created by concatenating nonhomologous sequence segments, with both segment length and total sequence length sampled from the length distributions of the true test sequences. The procedure, its rationale, and some of its caveats are described in more detail in Methods.
To benchmark a profile method (HMMER, SAM, or PSI-BLAST), a profile is built from each query alignment and searched against the target database of test sequences and decoys. The results of all searches (for all different queries) are merged and sorted by E-value, and this ranked list is used to calculate a plot of fraction of true positives detected at increasing thresholds of false positives per query from 0.001 to 10.
To benchmark a single-sequence query method (BLAST, SSEARCH, FASTA) more fairly against profile methods, a family-pairwise-search (FPS) method is used
The main result in both panels is that sensitivity and specificity are essentially identical for HMMER3 either with the acceleration pipeline (dark black lines) and without it (--max; red lines). There is a slight loss in sensitivity caused by the acceleration pipeline, but this loss is more than compensated by the gain in sensitivity of HMMER3 over HMMER2 (either in its default “glocal” alignment search mode or its local alignment search mode; orange circles and squares respectively). At high specificity (low false positive rates) on more realistic biased decoys, default HMMER3 can appear to be better than unfiltered HMMER3 (hmmsearch --max) because the bias filter removes some problematic biased false positive decoys that the supposedly more powerful biased composition corrections in HMMER3 Forward scores fail to correct.
Benchmarks for other programs (such as BLAST, SAM, FASTA, and SSEARCH) are shown only for the sake of rough comparison. The intent is not to thoroughly benchmark HMMER against these programs, but to provide an additional sense of scale, putting the difference between HMMER3 with and without its acceleration pipeline in context – that is, showing that the difference between HMMER3 with and without its acceleration pipeline is minor, compared to differences among programs. These programs were only run in their default configuration. I did not explore available options that might improve their performance on this benchmark. Although I believe the results to be fair and representative, I have to interpret these results with caution. I benchmark HMMER routinely on these benchmarks during development. It is impossible to avoid some degree of training on the benchmark, even though the benchmarks are somewhat randomized. Nonetheless, some informative trends in these results agree with previous independent published benchmarking from other groups, so the comparisons are probably useful as a rough guide. For example, Madera and Gough published a benchmark
Another story reflected in these benchmarks is about the widespread belief that full Smith/Waterman alignment is superior to BLAST's fast heuristic approximation of Smith/Waterman. This is true in the easier benchmark (Panel A) but not when decoy sequences include biased composition segments (Panel B). Indeed, in Panel A all three fast sequence search programs (WU-BLAST, NCBI-BLAST, and FASTA) perform comparably (and worse than SSEARCH), whereas in Panel B, NCBI BLAST outperforms WU-BLAST, FASTA, and SSEARCH. This again seems likely to be showing the importance of biased composition score corrections. Biased composition correction has received close attention in NCBI BLAST software development
My main conclusion (that the acceleration pipeline has a negligible impact on the sensitivity/specificity of HMMER3 compared to unaccelerated Forward scoring) is supported by a more direct experiment. I searched all 11,912 Pfam 24 profile HMMs against the 516,081 sequences in UniProt SwissProt 2011_03 using five different option settings to hmmsearch, starting with --max and then successively turning on one filter step at a time in the acceleration pipeline (MSV, bias, Viterbi, and Forward), up to the default configuration with all four filter steps on. With --max, a total of 799,893 hits were found with an E-value of 0.0001 or less. Turning on the MSV filter loses 718 hits (0.09%). Overall, the default pipeline with all filter steps loses 2,701 hits (0.3%). Differences in significant hits are not necessarily all due to true homologs. It is possible for the unfiltered search to find a false positive that one or more of the filters would remove. However, the majority of these differences appear to be true homologs that are removed by the filters. Other than the 718 hits removed by the MSV filter, the great majority of the other losses are due to the bias filter inappropriately removing sequences that have strong biased compositions, but also contain a true homology region. The most obvious problems with HMMER3 sensitivity/specificity seem to lie in its bias filter and bias composition score corrections, rather than in the use of the MSV filter as its primary acceleration.
In describing the MSV heuristic and other acceleration methods implemented in HMMER3, I have not addressed the question of whether the MSV heuristic is better or worse than other heuristics, such as those in BLAST or FASTA. In sensitivity/specificity benchmarks (
Although this paper is about the acceleration methods used in HMMER3, HMMER3 also appears to be more sensitive than HMMER2. The main reason for this is the adoption of “entropy-weighting”, a method introduced by the UC Santa Cruz group in the SAM profile HMM package
On the other hand, I believe that the switch from default glocal alignment in HMMER2 to local alignment in HMMER3 has probably compromised some search sensitivity (“glocal” means global in the query, local in the target sequence: requiring a full-length domain alignment). Restoring glocal alignment to HMMER3 should improve search performance for profiles that are expected to match over their entire length, such as Pfam protein domain models. However, the fast E-value statistics for Forward and Viterbi scores (including MSV filter scores) are only valid for local alignment, and the numeric underflow analysis of the sparse rescaling technique in the Forward/Backward implementation assumes local alignment. Both problems will need to be addressed before glocal alignment is implemented usefully in HMMER3.
Here I have only described single-core performance. I have not discussed parallelization across multiple cores. HMMER3 search programs include rudimentary implementations of POSIX threads and MPI parallelization (message-passing in a cluster of computing nodes). These implementations currently scale poorly, to only modest numbers of processor cores (2–4 for multithreading, for example). Improved parallelization is a priority for future development.
HMMER3's handling of biased composition sequences is problematic. I chose to introduce an
This paper describes an initial baseline for HMMER3 speed performance on a single processor core. The prospects for substantial future improvements are good. There are many obvious opportunities for incremental optimizations. Bjarne Knudsen (CLCbio, Aarhus, Denmark) has already contributed an important optimization of the MSV filter that increases overall HMMER3 speed by about two-fold. The Knudsen optimization will appear in the next HMMER code release, and we will likely describe it in a future manuscript. Another optimization opportunity is to preprocess the target database sequence file into an efficient binary format, as BLAST does with its BLAST databases. HMMER3 still reads standard flatfile sequence databases, such as FASTA and UniProt text formats. Another optimization opportunity is to convert the filters in the pipeline from their current mode of filtering entire target sequences (which was easy to implement) to instead store and retrieve more information about the location of alignment probability mass, so subsequent steps (including Forward/Backward) can be done as banded dynamic programming calculations within high-probability envelopes, as opposed to reprocessing the entire query/target comparison at each pipeline step. Because of these and other straightforward optimization opportunities, I expect HMMER3 speed will surpass NCBI BLAST speed in the relatively near future.
This speed makes it feasible to apply profile HMM technology to standard sequence database searches with single sequence queries, including iterative database searches. A position-independent scoring system for single sequences is just a special case of a profiled position-specific scoring system. A “profile” HMM can be built from a single sequence, using position-independent probabilities obtained from standard scoring matrices such as BLOSUM62, plus a couple of parameters for gap-open and gap-extend probabilities. The HMMER3 software package includes a program
HMMER3 is implemented in POSIX ANSI/ISO C99. Vector implementations are provided for Intel-compatible (SSE) and PowerPC (Altivec/VMX) processors. A “dummy” non-vectorized implementation is provided for other processors, sufficient to enable compilation but about 100× slower than normal HMMER3 vectorized performance. HMMER3 is portable to any POSIX-compliant operating system, including Linux and Mac OS/X, and also to Windows if an optional POSIX compatibility package has been installed (such as Cygwin). It builds from source using a standard GNU configure script and UNIX make. It includes a suite of automated tests written in Perl, C, and Bourne shell. User documentation is provided as PDF and man pages. Source code is freely available under a GNU GPLv3 license. Precompiled binary distributions are available for several platforms including Intel/Linux, Intel/MacOSX, and Intel/Windows. A web interface to HMMER3-based database searches is available, including batch searches and RESTful web services, hosted on computational resources supported by the Howard Hughes Medical Institute. All these resources are at
Software versions used: SAM 3.5 (Jul 2005)
Example sequence alignments and profile HMMs were sampled from Seed alignments and profiles in Pfam 24
All program timings were measured in total (wall clock) time on a single execution thread (single core) of a dedicated and unloaded cluster node, where a node has eight 2.66 GHz Intel Gainestown X5550 cores and 24 GB RAM. The same search was run twice sequentially and timed in the second run, to allow filesystem caching of target databases.
For speed benchmarks of programs that take a single sequence as a query (instead of an alignment), the median length sequence was extracted from the query alignment.
Sensitivity/specificity benchmarks were created with the create-profmark program, included in the HMMER3 source code. This program allows construction of the wide range of different and randomized benchmarks used during HMMER development. My concern is that because benchmarking is repeated at every step of code development, it is nearly impossible for a developer to avoid overtraining on any in-house benchmark. Synthesizing a variety of partially randomized benchmarks helps mitigate this effect somewhat, compared to relying on a single static benchmark.
The profmark benchmarks use a set of trusted multiple alignments (such as Pfam Seed alignments) as a source of both
Using a sequence database like Pfam instead of a 3D structure database like SCOP or CATH as a source of trusted true homology relationships has the advantage that a more challenging variety of sequences is tested. Structure databases are biased toward well-ordered globular domains. A weakness of a sequence-based benchmark is that “true homologs” are inferred by current computational sequence comparison methods, rather than being defined by an independent criterion like 3D structure comparison. In particular, my profmark benchmarks are constructed from Pfam alignments as a trusted definition of true homologs, and Pfam is itself constructed with HMMER. There may be some danger that this circularity creates a bias against other search programs. Specifically the danger is that if there are remote homologs that are undetected by profile HMM methods, but that could be detected by another method, any such sequences have been selected against in Pfam. Intuitively, I think this danger is negligible. If a search program was sufficiently powerful that it could detect homologs excluded from Pfam, it ought to be even better at detecting the closer homologs that were included and artificially separated (by profmark's clustering procedure) into challengingly dissimilar query alignment and true test domains. Empirically too, any danger has seemed negligible, because profmark benchmarks tend to be broadly concordant with other published benchmarks
False (decoy) sequences (including the nonhomologous flanking sequence around embedded test domains in true test sequences) are created synthetically. If we are trying to find methods that detect previously undetectable homologies, no source of real biological sequences will ever be reliably known to be nonhomologous to the benchmark, and we certainly do not want to penalize a powerful method that identifies new true relationships that are currently annotated as nonhomologous “false positives”
In detail, the profmark creation procedure is the following, starting from a source of multiple alignments (usually Pfam seeds) and a source of typical single target sequences (usually UniProt/SwissProt):
Convert all degenerate residue characters to X. (Although HMMER reads all standard degeneracy codes for protein and nucleic acid sequences, some search programs do not.)
Remove sequence fragments. By default, any sequence of length less than 70% of the mean unaligned sequence length in the alignment is defined as a fragment.
Cluster the sequences by single-linkage clustering at a default threshold of
Cluster the remaining sequences by single-linkage clustering at a default threshold of
Create synthetic true test sequences by embedding one or two true test domains in a larger nonhomologous sequence. The total length of the test sequence is sampled from the length distribution of the input sequence database, conditional on being at least as long as the true test domain(s). True test domains are inserted at randomly sampled locations in the sequence. The remaining two or three nonhomologous sequence segments are synthesized as described below. Thus true test sequences are composed either of three segments (one homologous, two not) or five segments (two homologous, three not).
The program implements a choice of several different methods for generating nonhomologous sequence segments. The default is “monoshuffling”: to select a sequence segment at random from the input sequence database and shuffle its residues, preserving 0th-order residue composition and bias.
Decoys (negative sequences) are created by randomly selecting a true test sequence (solely to obtain its three or five segment lengths – not its sequence) then concatenating nonhomologous segments of the same lengths. The length distribution of negative sequences, and the length distribution of potentially biased nonhomologous subsequences embedded in them, is therefore matched to the distributions for the true test sequences.
For the experiments in
Jeremy Buhler and Christopher Swope (Washington University, St. Louis) contributed important optimizations for the Viterbi and Forward implementations in HMMER2. Erik Lindahl (Stanford) contributed the first SIMD vector implementation of HMMER2, for PowerPC processors with VMX/Altivec instructions. Alex Coventry (Cornell), Steve Johnson (Washington Univ., St. Louis), and Elon Portugaly (Hebrew University) did exploratory work that led to the HMMER3 project. Bjarne Knudsen (CLCbio, Denmark) has contributed several optimizations and suggestions based on a careful analysis of the HMMER3 source code. Elena Rivas (Janelia Farm) contributed numerous clarifications and improvements to the source code in the course of reimplementing the core of HMMER3 for continuous emission variables for modeling mouse vocalizations. Although space permits only these brief personal acknowledgments, a great many others have contributed to the development of HMMER3, and I also thank this wide community of developers and users. Goran Ceric provided masterful support of Janelia Farm's high performance computing resources. Heidi Henning, HHMI deputy general counsel, provided invaluable legal assistance on open source issues. HMMER is a trademarked open source project supported by the Howard Hughes Medical Institute. I dedicate this paper to the memory of my colleague and friend Michael Farrar, principal software engineer on the HMMER3 project, who died unexpectedly in December 2010.