DPB, NK, and KS conceived and designed the experiments, analyzed the data, and wrote the paper. KS invented the original algorithm. DPB modified the algorithm.
The authors have declared that no competing interests exist.
Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at
Predicting the function of a gene or protein (gene product) from its primary sequence is a major focus of many bioinformatics methods. In this paper, the authors present a three-stage computational pipeline for gene functional annotation in an evolutionary framework to reduce the systematic errors associated with the standard protocol (annotation transfer from predicted homologs). In the first stage, a functional hierarchy is estimated for each protein family and subfamilies are identified. In the second stage, hidden Markov models (HMMs) (a type of statistical model) are constructed for each subfamily to model both the family-defining and subfamily-specific signatures. In the third stage, subfamily HMMs are used to assign novel sequences to functional subtypes. Extensive experimental validation of these methods shows that predicted subfamilies correspond closely to functional subtypes identified by experts and to conserved clades in phylogenetic trees; that subfamily HMMs increase the separation between homologs and non-homologs in sequence database discrimination tests relative to the use of a single HMM for the family; and that specificity of classification of novel sequences to subfamilies using subfamily HMMs is near perfect (1.5% error rate when sequences are assigned to the top-scoring subfamily, and <0.5% error rate when logistic regression of scores is employed).
While millions of novel genes have been discovered in recent years, the
The standard protocol for functional classification of novel genes is transfer of annotation from a database hit, i.e., predicting function based on sequence similarity between an unknown gene and one whose function is (presumably) known. This concept has given rise to many functional annotation methods [
By contrast, phylogenomic inference employs phylogenetic analysis of an entire protein family in order to predict function for individual members. By overlaying experimental data on the phylogenetic tree, a biologist can identify where, in evolution, genes may have been duplicated, lost, or transferred horizontally, or adopted new functions. Phylogenomic analysis thereby enables a biologist to “fill in the blanks” with an extremely low error rate, and often with significant detail [
We present two methods useful in automating phylogenomic inference: de novo subfamily identification using SCI-PHY (Subfamily Classification in Phylogenomics) and classification of novel sequences using subfamily hidden Markov models (HMMs). These two methods form part of a computational pipeline for phylogenomic inference of gene function that was originally developed for the functional classification of the human genome at Celera Genomics [
De novo subfamily identification—partitioning of sequences in a dataset into subtypes—provides two advantages for high-throughput systems of functional classification. First, assuming at least one subfamily member has been experimentally characterized, it becomes possible to infer function for other members of the subfamily. Second, the identification and curation of known subfamilies enables biologists to use sequence-based classification methods (e.g., using profiles, HMMs [
Existing methods for de novo identification of specific subtypes fall into two camps: those that define clusters using pairwise similarity, e.g., InParanoid [
Both Secator [
SCI-PHY is a fast method of subfamily identification which uses only sequence information, in contrast to phylogenetic tree methods that require species information to resolve orthologs from paralogs for functional analysis. Therefore, SCI-PHY is especially advantageous in situations where species information is not known, such as in environmental sequences. Our experiments show that SCI-PHY subfamilies correspond closely to subtypes found by experts and also to conserved clades identified using standard phylogenetic tree analysis.
The availability of subfamily classifications enables high-throughput functional annotation: as new sequences are released to the sequence databases, sequence-based classification methods can be used to efficiently assign unknown sequences to pre-defined subtypes. Several classes of methods have been developed for this task. Profiles and HMMs are statistical models that generalize the information in a multiple sequence alignment (MSA) [
Methods designed specifically for classification of sequences to predefined subfamilies include the profile-based method of Hannenhalli and Russell [
Our approach to classification of novel sequences to functional subfamilies uses subfamily hidden Markov models and a computationally efficient scoring system [
The ability to predict novel subtypes in a protein family is extremely valuable in identifying functional shifts in newly sequenced genomes. In addition to classification of novel sequences to predefined subfamilies, we present a method of logistic regression of positive and negative examples for subfamilies. This method enables discrimination between novel sequences that can be reliably classified to an existing subfamily and those that are more likely to represent entirely different subtypes from any previously observed.
515 full alignments were selected from the PFAM resource [
A subset of 57 families from SCOP-PFAM515 contains sequences with multiple enzymatic functions based on all four fields of their Enzyme Classification (EC) [
This dataset contains five extensively curated protein families from three different resources, with additional subdivisions of two of the families to create a total of eight classifications of expert-defined subtypes. We selected the enolase and crotonase enzyme families from the Structure-Function Linkage Database (SFLD) [
We compared SCI-PHY to three other methods for protein subfamily identification that depend only on sequence information: Secator, Ncut, and CD-HIT. CD-HIT takes a user-specified minimum percent identity as a parameter for determining cluster membership; we present results for two identity cutoffs: a comparatively low value (40%) in order to identify fairly general functional groups, and a higher value (70%) that has been identified as the minimal identity required to guarantee functional similarity within subfamilies [
We used three scoring functions—purity, edit, and variation of information—to measure the agreement between the reference subtypes in each benchmark dataset and the subfamilies predicted by the methods tested. The
Finally, we also assessed agreement between SCI-PHY subfamilies and phylogenetic trees and found that SCI-PHY subfamilies typically correspond to well-supported clades within the family (
These experiments highlighted the classic tradeoff between specificity and sensitivity. High purity and low edit and VI distances for a method indicate that the subfamily decomposition achieved by that method is very similar to the reference partition. Of all the methods tested, SCI-PHY has the most consistent performance in combining both high purity and low distance to the reference partition. For instance, SCI-PHY has the best (lowest) edit and VI distances of all methods tested, ranking first for four out of eight EXPERT datasets. Secator comes next, with the best edit or VI distance for two of the eight datasets. Other methods tested have either high purity but a large distance (e.g., CD-HIT70) or sacrifice purity for a lower distance (CD-HIT40 and Secator). The performance of the Ncut method differed between the families; for example, it performed very well with the crotonase family, but then clustered 317 of the 328 sequences in the aminergic family, spanning multiple GPCR subtypes into one subfamily. The single notable exception is the enolase family, for which SCI-PHY, Secator, and Ncut produced a large number of singleton clusters (52, 29, and 35, respectively), giving each a poor distance score. A detailed comparison of performance of the different methods on the adreno-receptor family of GPCRs is given in
De Novo Subfamily Identification for the EXPERT Set
Assessing de novo subfamily identification accuracy by comparison with expert-defined subtypes presents unique challenges. First, there is a wide variation in expert definitions of functional classes—some expert-defined subtypes are highly specific and span short evolutionary distances, while others cluster proteins at a much coarser level and may include highly divergent sequence pairs. Similarly, subfamily prediction methods tend to aim at different points in this spectrum. For instance, CD-HIT40 clusters at a fairly coarse level and has comparatively poor purity scores for highly specific expert-defined levels, but fairly good edit and VI distance scores. By contrast, CD-HIT70′s subfamily purity is the best of the five methods tested, but it has the worst edit distance on seven out of eight classifications, and the worst VI distance on four of the eight datasets.
Comparing de novo subfamily identification methods against an expert-defined hierarchy reveals the inherent biases of these methods, as each tends to target a different level of a functional hierarchy. This is illustrated by comparison of subfamily prediction methods to the different levels in the aminergic GPCR and NHR families in the EXPERT dataset. Here, Secator's division of the NHR family is closest to the coarse level 1 classification (although Secator's purity scores are poor even at this level), while SCI-PHY and CD-HIT40 match the more specific level 2 classification. CD-HIT70 performs very well at the most specific level (NHR L3), with high purity and low VI distance. Interestingly, CD-HIT with a 50% identity cutoff (CD-HIT50) seems to give a better balance of purity and distance than CD-HIT70 (
The different scoring functions used to evaluate subfamily identification highlight the standard problem in function prediction: achieving a balance between sensitivity and specificity. The purity score measures specificity, whereas the distance functions correspond more closely to sensitivity. There are subtle differences between the two distance functions. Both the edit distance and the VI distance penalize over-division as well as mixing of subtypes, but the edit distance penalizes over-division of subtypes proportionately more than joining a few subtypes into large clusters. The edit distance thereby favors methods such as Secator and CD-HIT40 that produce fairly coarse clusterings. The VI distance takes cluster size into account, and errors in large clusters (affecting many sequences) contribute more to the distance than errors in small clusters. These effects are illustrated by the change in distance-based rank between SCI-PHY and Secator for the Secretin family. On this family, SCI-PHY had a better VI distance than Secator, but a worse edit distance. Examining the two predicted partitions relative to the expert division into subtypes shows why. The SCI-PHY subfamily prediction had high purity (only one SCI-PHY subfamily merged two different subtypes together), but somewhat over-divided expert subtypes, splitting three expert subtypes into multiple subfamilies and producing six singleton subfamilies. In contrast, Secator had low purity (three of the six subfamilies produced by Secator joined several subtypes together, placing nearly 70% of the sequences in the family into mixed subfamilies) but did not subdivide expert subtypes, and very few split operations were required to obtain the expert classification.
Performance for the EC dataset (
Wilcoxon Signed Rank Tests for De Novo Subfamily Detection on the EC Dataset
HMMs and profiles are very effective at detecting distant homologies. Since primary sequence diverges more rapidly than 3-D structure, the scientific community uses the ASTRAL datasets of solved structural domains from the SCOP database [
We compared the ability of family and subfamily HMMs (for subfamilies defined by SCI-PHY) to identify remote homologs using the SCOP-PFAM515 dataset, spanning 515 PFAM families representing unique 3-D folds. Results of these experiments show that subfamily HMMs significantly increase the separation between true homologs and spurious matches by improving the scores of related sequences. For instance, at an e-value cutoff of 10−20, SHMMs detected 73% of SCOP superfamily members, whereas family HMMs detected only 31%. This increase in score significance for homologous sequences comes at no cost in error rate: ROC plots of subfamily and family HMMs superpose closely, with the AUC of subfamily HMMs being slightly greater than the AUC of family HMMs (
Blue: family HMM results. Red: subfamily HMM results.
(A) Coverage (
(B) ROC curve for family and subfamily HMMs, weighted by superfamily size. Subfamily HMMs receive an AUC of 0.947; family HMMs receive 0.943.
(C) ROC curve for unweighted data. Subfamily HMMs and family HMMs have AUCs of 0.758 and 0.740, respectively. Together, these data show that while subfamily HMMs do not detect more homologs at a given false positive rate, they do find many more homologs at a given significance cutoff.
We tested subfamily classification accuracy using leave-one-out experiments. Ten sequences from each family in the SCOP-PFAM515 dataset were individually removed from the alignment, SHMM parameters were estimated without the withheld sequence, and the sequence was then scored against all SHMMs. 98.5% were assigned to their original subfamily, producing an error rate of only 1.5%. Since many of the sequences tested were highly similar to sequences present in the alignments, we also tested classification accuracy following alignment editing to remove sequences with different levels of percent identity to the chosen sequence. We compared SHMM performance with the use of BLAST, and to the sub-profile method of Hannenhalli and Russell [
Error in Function Prediction Is Revealed by Clustering the Misannotated Sequence with Its Homologs Using SCI-PHY
Novel Sequence Classification on the SCOP-PFAM515 Dataset, after Removal of Sequences Similar to the Target Sequence
In these experiments, the Hannenhalli and Russell sub-profile method showed only a marginal improvement in classification accuracy over the BLAST and HMM methods, in contrast to earlier work, which showed a dramatic improvement over BLAST and HMM-based classification for more divergently related families [
Protein families naturally expand in size to accommodate additional homologs produced by genome-sequencing initiatives. Many of these new members will belong to known subtypes, but some will represent novel subtypes having distinct functions. We have developed an online algorithm to assess the likelihood that an unknown sequence represents a novel subfamily.
Since classification of sequences to existing subfamilies based on top subfamily HMM scores has an extremely low error rate, we treated this task as a binary classification problem, asking the question, “Does the test sequence belong to the top-scoring subfamily, or does it represent a novel subtype?” We used logistic regression to predict the probability of subfamily membership based on the HMM reverse score (
The logistic regression fit for an example subfamily is shown. True subfamily members (X) and other family members (+) are shown, together with the fitted curve. When the two classes cannot be completely separated, as in this case, we see a smooth transition in the probability of subfamily membership.
We tested novel subtype identification as follows. For each PFAM MSA containing at least three SCI-PHY subfamilies, we removed an entire SCI-PHY subfamily (selected at random) and re-estimated HMM parameters for the remaining subfamilies. Retained sequences were used to fit regression curves for each SHMM by all-against-all scoring within the family. Each sequence from the withheld subfamily was then scored against the new set of SHMMs, and the probability that it belonged to the top-scoring subfamily was calculated. We assessed subtype detection sensitivity for a range of membership probability thresholds (
(A) The red line shows the fraction of novel subfamilies correctly detected; the blue line shows the fraction of subfamily members correctly classified in leave-one-out experiments. Novelty detection is quite robust to the threshold setting, obtaining 80% success rate even at the lowest threshold (0.01).
(B) The fraction of sequences classified to an incorrect subfamily during leave-one-out experiments. While low to begin with, the false positive error drops dramatically with the imposition of even a small threshold. A threshold of 0.10 probability of subfamily membership seems to be optimal; the false-positive classification rate is just 0.3%, while overall subfamily classification and novel subtype detection accuracy are both 88%. The
We subtract the encoding cost of the null hypothesis (that all sequences belong in a single subfamily) from the cost of encoding the subclass alignments at each iteration of the algorithm (
We then assessed the impact of this classification protocol for the complementary task: subfamily classification of a test sequence belonging to an existing subfamily. In this case, we repeated the 5,103 leave-one-out experiments described in the previous section, this time fitting regression curves to each subfamily and calculating the probability that the withdrawn sequence was a true member of the best-scoring subfamily. At high stringency (i.e., requiring a high subfamily membership probability), the number of mis-classified sequences (false positives) is minimal, but sensitivity is reduced (
An unanticipated effect of this thresholding process was to greatly reduce the fraction of false positive classifications (
Sequence Q8S220, a singleton subfamily, was classified to its sibling subfamily, N2581. We show a comparison of the sequence as aligned in the original MSA (Q8S220-orig) and after alignment to SHMM N2581 (Q8S220-N2581). The consensus sequence for SHMM N2581 is also shown (N2581-consensus). After realignment, much of the sequence has been shifted, and several motifs now clearly match the N2581 consensus sequence (red boxes).
We have applied the methods described here in the construction of a phylogenomic HMM library, the PhyloFacts Universal Proteome Explorer, with more than 40,000 protein family “books” and more than 1.2 million HMMs to enable subfamily classification of novel sequences [
We present an illustration in which clustering sequences using SCI-PHY enables detection of existing errors in database annotations.
The GenBank sequence JC7675 (
Additional examples of effective annotation transfer and error detection are given in
Phylogenomic analysis is widely regarded as the method of choice for high-accuracy functional annotation but has had limited application due to the technical complexity of this protocol. This paper focuses on methods to automate a phylogenomic pipeline using defined subfamilies followed by construction of HMMs for these subfamilies, which can then be used to classify novel sequences.
A large fraction of phylogenomic inference tools focus on the identification of orthologs as the basis of annotation transfer, under the assumption that orthologs—related by speciation from a common ancestor—are likely to maintain the same function. Methods developed for phylogenomic inference that use species information in conjunction with phylogenetic tree analysis to identify orthologs include RIO [
Any division of a family into functional subtypes is somewhat arbitrary, as proteins have different levels of molecular function ranging from the fairly coarse (e.g., catalytic activity) to highly specific (e.g., substrate recognition). For this reason, some protein classification databases, such as the GPCRDB [
Subfamily identification methods that rely on an MSA as input, including SCI-PHY and Secator (and most phylogenetic tree construction algorithms), tend to be quite sensitive to alignment errors. We therefore recommend careful attention to the construction of the MSA for the family. Removing columns having many gap characters is analogous to alignment masking prior to phylogenetic tree construction, and is recommended. A protocol for collecting and aligning homologs is given in [
Our results show that subfamily HMMs provide high specificity of sequence classification to functional subtypes, providing a kind of automated phylogenomic inference that approximates the results achievable from a more compute-intensive phylogenetic reconstruction. The information-sharing protocol we present produces subfamily HMMs that generalize effectively to distant homologs. Information sharing leverages available training data and helps to smooth estimated amino acid distributions to prevent overly specific HMM parameters in small subfamilies. This information-sharing protocol more efficiently separates homologs from non-homologs than subfamily HMMs without information sharing, but at a slight cost in subfamily specificity (i.e., the error rate for subfamily classification without information sharing is 0.8%, while our standard information-sharing protocol has an error rate of 1.5%).
In these experiments, family and subfamily HMMs showed similar classification error rates, although subfamily HMMs produce much more significant e-values for true positives, in addition to identifying subfamily membership. This suggests a simple way to reduce the computational burden of using SHMMs, which we use in practice. Rather than scoring novel sequences against all SHMMs from all families, we screen sequences for family membership using family HMMs and then identify the appropriate subfamily by scoring the sequence only to the SHMMs of that family. Since most HMM libraries contain thousands of families, the average increase in scoring runs due to the use of SHMMs is then marginal.
Logistic regression of subfamily HMM scores enables us to discriminate between sequences representing entirely novel subtypes and sequences that can be assigned to existing subtypes. This confers a unique capability to subfamily classification systems that is critical to prevent overly specific (incorrect) predictions of molecular function for novel sequences.
All methods of constructing subfamily models as a means of classifying novel sequences will be sensitive to the inclusion of outlier sequences in a family. A single or small number of outlier sequences normally have minimal effect on a profile or HMM constructed for the family as a whole (since their contribution is typically washed out by the dominant group) and may remain undetected. However, the use of subfamily models, whether through subfamily HMMs, as outlined here, or by another method, can magnify the power of these outliers to attract and recruit their relatives. This may be desirable when outliers are actual homologs, but is generally not desirable in the case of spurious database hits. However, if non-homologous outliers can be flagged, their corresponding subfamily models can be used as decoys, differentiating true family members from those that only appear to be related.
The input to SCI-PHY is an MSA, from which a hierarchical tree and subfamily decomposition are estimated. SCI-PHY uses agglomerative (bottom-up) clustering to construct a hierarchical tree: the input objects form the leaves in the tree; similar objects are joined by edges to form subtrees, and the process is iterated until a rooted tree is obtained.
Each sequence forms a separate class (leaf in tree). For each class, construct a profile, using Dirichlet mixture densities [
While (#classes >1) do:
1. Join the two closest classes into a new class, represented by a new node in the tree. Add edges from the new node to each daughter node.
2. Construct a profile for the new class based on the joint MSA.
3. Compute the distance between this new class and other classes (
4. Compute the encoding cost of this partition, under a Dirichlet mixture density (
1. Hierarchical tree.
2. Predicted subfamilies, corresponding to the stage in the agglomeration having the lowest encoding cost.
P(
The encoding cost function has two components: the first term is the cost to encode the subfamily labels for each sequence; the second term is the cost to encode each of the subtree alignments for that stage in the agglomeration. The two terms have opposite effects. The first term is large at program commencement when the number of subfamilies is largest, and reduces at each iteration, until it reaches zero at program termination, when there is one subfamily. The second term is minimized when the sequences within each subfamily are very similar to each other. At program commencement, for an input MSA with
Sequence weighting is a standard approach in profile and HMM construction to prevent large subgroups from dominating amino acid distributions [
We estimate sequence weights for each subfamily in a two-step process. In the first step, we estimate the total number of independent counts (NIC) in the alignment, as follows. We compute for every position in the alignment the frequency of the most frequent amino acid (ignoring gaps) to derive the positional conservation propensity. We then find the average of this value over all columns having at least one amino acid to obtain the overall conservation propensity (
The input to subfamily HMM construction is an MSA and a decomposition of the alignment into subfamilies. We construct SHMMs in a multi-step process.
The component parameters
Thus, the generalization capability of subfamily HMMs is enhanced by adding in weighted counts from subfamilies having similar amino acids at corresponding positions.
Since the majority of the sequences in each of these families had no assigned EC numbers, subfamily clustering methods were performed on the full (edited) alignments, but accuracy was assessed using only annotated sequences aligning over >75% of their lengths.
EXPERT: we selected the enolase and crotonase enzyme families from the SFLD [
Here,
The
In the novel subtype detection experiments, we removed up to five complete subfamilies at random from each family, ignoring families with only two subfamilies (preventing the case where regression curves would have been trained with no negative examples). Results were normalized by subfamily size. Logistic regression parameters were fit using the iteratively re-weighted least squares (IRLS, [
(69 KB PDF)
(136 KB PDF)
(106 KB PDF)
(37 KB PDF)
(33 KB PDF)
(47 KB PDF)
The authors thank Dan Kirshner for developing and maintaining the SCI-PHY Web server, Dr. Ruchira Datta for careful proofreading of the manuscript, and the creators of the resources used in these experiments: PFAM, GPCRDB, NucleaRDB, and SFLD. This work is supported by US National Institutes of Health grant R01 HG002769 and a Presidential Early Career Award in Science and Engineering from the US National Science Foundation to KS.
enzyme commission
errors per query
extreme value distribution
gene ontology
G-protein coupled receptor database
hidden Markov model
maximum likelihood
majority rule
multiple sequence alignment
nuclear hormone receptor
number of independent counts
neighbor joining
Subfamily Classification in Phylogenomics
Structural Classification of Proteins
Structure Function Linkage Database
subfamily hidden Markov model
support vector machine
total relative entropy
variation of information