Open Access
Research Article
Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
1 Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America, 2 Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America, 3 School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
Abstract
We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era.
Author Summary
New genes evolve through the duplication and modification of existing genes. As a result, genes that share common ancestry tend to have similar structure and function. Computational methods that use common ancestry have been extraordinarily successful in inferring function. The practice of discerning evolutionary relationships is stymied, however, by modular sequences made up of two or more domains. When two genes share some domains but not others, it is difficult to distinguish a case of common ancestry from insertion of the same domain into both genes. We present a formal framework to define how multidomain genes are related, and propose a novel method for rapid, robust characterization of evolutionary relationships. In an empirical comparison with the current state of the art, we demonstrate superior performance of our method using a large hand-curated set of sequences known to share common ancestry. The success of our method derives from its unique ability to infer evolutionary history from local topology in the sequence similarity network. This represents a departure from the view that protein family classification must be restricted to families with conserved architecture. By exploiting the structure of the sequence similarity network, our approach surmounts this limitation and opens the door to studies of the role of modularity in protein evolution.
Citation: Song N, Joseph JM, Davis GB, Durand D (2008) Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins. PLoS Comput Biol 4(5): e1000063. doi:10.1371/journal.pcbi.1000063
Editor: Christine Vogel, University of Texas at Austin, United States of America
Received: February 15, 2007; Accepted: March 18, 2008; Published: May 16, 2008
Copyright: © 2008 Song et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was funded by NSF DBI-0641313, NIH grant 1 K22 HG 02451-01, and a David and Lucille Packard Foundation fellowship. These organizations have had no role in the design or execution of this study or in the preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
* E-mail: durand@cmu.edu
Introduction
Accurate identification of homologs, sequences that share common ancestry, is essential for accuracy in function prediction and comparative genomics. Homology identification is integral to the annotation of novel genes [1] and prediction of gene function by various methods, including phylogenetic clustering [2], gene fusion analysis [3],[4], phylogenomic inference [5], and genomic context [6],[7]. Homologous genes are used as markers to identify homologous chromosomal regions for comparative mapping [8],[9], analysis of whole genome duplication [10],[11], phylogenetic footprinting [12], and operon prediction [13]–[15]. Pairwise homology detection is also an integral component of clustering approaches to protein family classification ([1],[16], and work cited therein).
All of these applications exploit one or both of the following properties of homologous sequences: genes that share common ancestry tend (1) to have similar structure and function, and (2) be located in homologous chromosomal regions, making them suitable markers for comparative genomics. Because of their prevalence and importance, it is desirable to incorporate multidomain sequences in such analyses: Multidomain proteins represent 40% of the metazoan proteome, with functional roles in signal transduction, cellular adhesion, tissue repair, and immune response [17]. However, multidomain sequences, especially those with promiscuous domains that occur in many contexts, are frequently excluded from genomic analyses due to lack of a theoretical framework and practical methods for detecting multidomain homologs. In this paper, we extend the traditional definition of homology [18] to multidomain sequences that share a common ancestral gene, providing a formalism suitable for modeling multidomain family evolution, design and validation of multidomain homology identification methods, and incorporation of multidomain sequences in genomic analyses.
The original definition of molecular homology [18] does not capture multidomain evolution. Homology traditionally refers to evolution from a common ancestor by vertical descent (e.g., gene duplication and speciation), but multidomain proteins evolve via both vertical descent and domain insertion. For example, Figure 1 depicts two genes, a and b, which share not only a homologous domain but also a common ancestral gene. In contrast, b and c are a domain-only match, a pair of sequences that share similarity due to insertion of the same domain into both sequences but are otherwise unrelated.
Figure 1. The evolution of a hypothetical multidomain family by gene duplication and domain insertion.
Genes in the a and b subfamilies share a common ancestor but do not have identical domain composition. Gene c shares a homologous domain with genes in the b subfamily, but there is no gene that is ancestral to both b and c.
doi:10.1371/journal.pcbi.1000063.g001Beta platelet-derived growth factor receptor (PDGFRB) and cGMP-dependent protein kinase 1, beta (PRKG1B), in Figure 2A, are enzymes involved in protein amino acid phosphorylation and provide a concrete example of this situation. Phylogenomic and structural evidence [19]–[22], as well as the promiscuity of the Ig and cNMP-binding domains, supports the common ancestry of this pair (see Methods). They have a statistically significant alignment with an E-value of 2.4e−8 that covers 13% of the average of their lengths. While they share a common domain (Pkinase), the Ig domains are unique to PDGFRB and the cNMP-binding domains are unique to PRKG1B. An example of a domain-only match is shown in Figure 2B. Neural cell adhesion molecule 2 (NCAM2) and PDGFRB share two Ig domains, resulting in a significant alignment, also with an E-value of 2.4e−8, and alignment coverage of 24%. However, the genes that encode them are not homologous and they perform different functions: NCAM2 is involved in cell-cell adhesion with no enzymatic function.
Figure 2. Domain models of a pair of multidomain homologs and a pair of sequences with a domain-only match.
(A) Domain architectures of the multidomain homologs PDGFRB and PRKG1B. These sequences share a Pkinase domain, but have different auxiliary domains. (B) Domain architectures of PDGFRB and NCAM2, which have significant sequence similarity due to shared Ig domains, but do not share common ancestry.
doi:10.1371/journal.pcbi.1000063.g002The ability to distinguish multidomain homologs from unrelated pairs that share a domain is essential to genomic analysis. The evolutionary relationship between a and b in Figure 1 supports inferences about genome evolution, organization, and function. The same inferences would not necessarily be justified by the evolutionary relationship between b and c. For example, chromosomal regions enriched with homologous gene pairs are likely to be homologous themselves. In contrast, enrichment with homologous domains does not support the inference that a pair of chromosomal regions is homologous. Heuristics based on similarity and alignment coverage (the fraction of the mean sequence length covered by the optimal local alignment) have been proposed to screen out domain insertions. Recently, approaches based on domain architecture comparison have also been proposed [23]–[26]. To our knowledge, despite the prevalence of methods based on sequence similarity and alignment coverage [27]–[37], the accuracy of these heuristics has never been systematically tested. However, the examples in Figure 2 raise doubt about the general effectiveness of these methods. Both pairs have weak sequence similarity, short alignments, and a similar combination of shared and unique domains. Setting a significance threshold to eliminate NCAM2 would also eliminate roughly 240 sequences that are related to PDGFRB, since more than a quarter of the Kinases that match PDGFRB have E-values less significant than 2.4e−8. Alignment coverage would not help distinguish these two cases: the homologous pair has a shorter alignment than the unrelated pair. Nor could we separate this case by comparing domain content, since PDGFRB and PRKG1B share one domain, while PDGFRB and NCAM2 share two. For this example, sequence similarity, the length of the shared region, and domain architecture comparison all fail to distinguish the homologous pair from the domain-only match.
To determine the extent of this problem, here we evaluate sequence similarity, alignment coverage, and domain architecture comparison on a hand-curated benchmark of 853,465 known homologous pairs. Our results show that these heuristics are all insufficient for consistent, reliable identification of multidomain homologs. Surprisingly, given its widespread use, even a modest alignment coverage requirement dramatically increased the number of mis-assigned homologs in our study. These results challenge two unstated, but widely accepted hypotheses: (1) homologous sequences share similarity along the bulk of their length and (2) the local alignment between homologous sequences usually covers a greater fraction of their mean length than the local alignments of sequences that only share a domain.
These observations suggest to us that sequences alone may not consistently contain enough information to differentiate homology from domain-only matches. We introduce a novel method, called Neighborhood Correlation, that leverages additional information contained in the weighted sequence similarity network to distinguish homologs from domain-only matches. In this network, each vertex corresponds to a sequence. Vertices whose corresponding sequences have significant similarity are connected by an edge with weight proportional to that similarity. The neighborhood of a sequence is the set of vertices adjacent to it; that is, the set of all sequences that match it above a predefined significance threshold. (In this work, “sequence neighborhood” refers to the local context of the sequence in the network and not to the region immediately surrounding it in the genome.) Our analysis demonstrates that the neighborhood structure of gene pairs related through shared domain insertions is characteristically different from that of pairs related through duplication or speciation. These differences in neighborhood organization are detectable and can be exploited to distinguish homology from domain sharing.
A homology detection method for genomic analysis must meet the following criteria: It should correctly predict homologous pairs and reject unrelated pairs, including those that share domains. With a single set of parameter values, it should perform reliably on sequences with a broad range of attributes, including single domain families, multidomain families, families with short regions of conservation, and families with weak sequence homology. Finally, it should be easy to use and fast enough for datasets comprising hundreds of millions of sequence pairs.
In an empirical evaluation, we demonstrate that Neighborhood Correlation meets these criteria. It is highly effective in classifying multidomain homologs and achieves superior performance in comparisons with sequence similarity (BLAST and PSI-BLAST), alignment coverage, and domain architecture comparison. To evaluate performance, we hand-curated a benchmark of 853,465 known homologous pairs of mouse and human sequences, drawn from twenty well-studied families. Our test set includes single-domain families, as well as multidomain families with promiscuous domains that are at risk for domain-only matches. Although comprehensive datasets are available for testing methods for predicting homology of individual domains [38],[39], we are unaware of any other gold-standard dataset of known multidomain families with variable domain architectures. We offer this validation dataset, which is based on published evidence by experts on each of the families, as a resource for future studies.
As a validation of our approach, we applied Neighborhood Correlation to all complete, mouse and human sequences in SwissProt 50.9 to predict homologs. A comparison of our predictions with the euKaryotic Clusters of Orthologous Groups (KOGs) database [40] showed that the set of protein sequences with highly correlated neighborhoods includes the vast majority of pairs that share an orthologous group (i.e., have the same KOG annotation). This is consistent with the fact that orthology is a more restrictive criterion than homology. We also show that most pairs in our set of predictions share at least one domain, according to the Pfam database [41], but many sequence pairs that share a domain are excluded. This is consistent with our goal of identifying gene homology rather than domain homology.
Results
Homology has traditionally been defined in terms of families that evolve by vertical descent [18],[42]; that is, by speciation and gene duplication. However, multidomain sequences evolve by speciation, gene duplication, and acquistion of domains from outside the family [43] (Figure 1). The traditional definition of homology does not apply in this case, as previous authors have pointed out [42],[44]. In the words of Walter Fitch [42], “We must recognize that not all parts of a gene have the same history and thus, in such cases, that the gene is not the unit to which the terms orthology, paralogy, et cetera, apply.” It has been proposed that sub-genic sequence fragments should be the units of interest [44],[45]. However, there are many applications, such as ortholog detection, comparative mapping, and phylogenetic footprinting, for which it is essential to work with a definition of homology where the gene is the basic unit. Moreover, in order to study the evolution of multidomain gene families, it is necessary to focus on genes. The gene is the unit of selection. While domains confer modular function on genes, ultimately it is the functionality of those genes drives their retention.
A Model of Multidomain Homology
Here, we propose a model of multidomain homology based on vertical descent and insertion of a sequence fragment into an existing gene. In our model, two sequences are homologous if they are encoded by genes that share an ancestral locus. The rationale for this definition is illustrated in Figure 3, which shows the evolution of genes through vertical descent and domain insertion in the context of the chromosomes in which they reside. When genomic context is taken into account, it is clear that genes g2 and g2′ are homologous, despite the fact that g2 contains a domain not present in g2′ and vice-versa. In contrast, genes g2 and g3′ are not homologous, despite the fact that they share a homologous domain, since g2 and g3′ are not located in chromosomal regions that share common ancestry. For comparative mapping applications, where homologous genes are used as markers for identifying chromosomal regions, this distinction is crucial. For example, phylogenetic footprinting [12] predicts transcription factor binding sites by identifying homologous genes and then searching their flanking chromosomal regions for conserved sequence motifs. In Figure 3, the regions upstream of g2 and g2′ have an elevated probability of sharing conserved motifs since they share common ancestry. However, there is no reason to expect an enrichment of motifs shared between the flanking regions of g2 and g3′.
Figure 3. Evolutionary history of multidomain sequences in genomic context.
(A) A hypothetical genome with two chromosomes. (B) Both chromosomes are copied through duplication or speciation, resulting in two identical copies. (C) Following sequence divergence, similarity is only retained in coding regions. (D) Two instances of the orange domain are inserted in g2 and g3’, respectively. A yellow domain is inserted in g2’. (E) Conserved genomic context shows that genes g2 are g2’ are homologous genes, although they contain unrelated domains. Similarly, genes g2 and g3’ contain homologous domains, but are not homologous genes.
doi:10.1371/journal.pcbi.1000063.g003Our model is applicable to families that evolved through acquisition of a new domain by an existing gene. This can occur through insertion of sequence fragments into the gene or by recruitment of adjacent exons. Formation of a new gene architecture by domain loss is also consistent with our model. Several lines of evidence suggest that acquisition of an auxiliary domain by an existing gene is a relatively common mode of domain shuffling. First, a substantial number of metazoan, chordate, and vertebrate families have been identified that evolved through a pattern of duplication, insertion of domains, and further duplication, a pattern consistent with this model [46],[47]. Second, the existence of promiscuous domains that lend themselves to insertion in new chromosomal environments [48],[49] supports an insertion model. Third, domain insertion is more likely to be successful when a domain is inserted into an existing functional environment, e.g., into the intron of an existing gene. In this case, all regulatory and termination signals required for successful transcription are already present. A fourth line of evidence stems from analyses of the flanking DNA of genes that arose very recently, where traces of the particular domain shuffling mechanism that occurred can still be observed. A number of recently evolved metazoan genes have been discovered that arose through duplication of an existing gene, followed by acquisition of one or more domains by unequal crossing over or by retrotransposition [50]–[54]. Finally, a number of studies have inferred relative rates of various domain shuffling events by applying parsimony models to abstract domain architectures. Their results suggest that the most common domain shuffling scenario involves insertion or deletion of a single domain into an existing multidomain architecture [24],[55],[56].
Our model is not applicable to the case where a new domain architecture is assembled de novo from several unrelated building blocks and subsequently acquires a regulatory region. We consider such a novel architecture to be the progenitor of a new family, since it is not clear that the ancestry of any one constituent is preferred. Similarly, our model does not capture formation of new architectures through fragmentation of more complex ones. However, recent evidence suggests that both of these scenarios occur rarely [24],[55],[57].
Neighborhood Correlation
Homology detection is the problem of distinguishing between sequence pairs with different types of evolutionary histories: evolution via gene duplication or via domain insertion. Sequence similarity, alignment coverage, and domain architecture comparison have all been considered for this purpose. However, none of these distinguish the homologous pair from the domain-only match given in Figure 2. The empirical results in the following sections confirm that this is not an isolated example. Accurate classification of multidomain homologs requires additional information from another source.
The structure of the sequence similarity network provides a basis for distinguishing pairs related through vertical descent from other pairs. The local network neighborhoods of homologs and domain-only matches differ in both topology and edge weights. In particular, for homologous pairs, the shared neighborhood (i.e., the set of vertices adjacent to both members of the pair) tends to have more vertices and stronger edge weights than their unique neighborhoods (i.e., vertices adjacent to one pair but not the other). This is not true for domain-only matches. We express this distinction quantitatively by the Neighborhood Correlation score of two sequences, defined to be the correlation coefficient of their respective neighborhoods:(1)
where S(x,i) is the normalized bit score [58] of the optimal local alignment of query sequence x and database sequence i, N is the number of sequences in the database, and is the mean of S(x,i) over all sequences (see Methods). Note that NC(x,y) increases with the number, weight, and correlation of edges in the shared neighborhoods of x and y and decreases with the number and weight of edges in their unique neighborhoods.
The Neighborhood Correlation score captures properties of the sequence similarity network that are strongly influenced by the evolutionary processes of interest. The number of edges in the shared and unique neighborhoods is influenced by the rates of gene duplication and domain insertion, while edge weights depend on sequence divergence. Immediately following a gene duplication, the two resulting paralogs have identical neighborhoods. The Neighborhood Correlation score of this new pair is initially one and decreases as the sequences diverge. Additional gene duplications in the same family further increase the size of the shared neighborhood and, hence, the Neighborhood Correlation score. In contrast, if a domain is inserted into a single member of the pair, the number of edges in its unique neighborhood increases and the Neighborhood Correlation score decreases. The increase in the number of unique edges is directly related to the promiscuity of the inserted domain, while the weights of these new edges are proportional to the degree of sequence conservation in the domain superfamily. In practice, the impact of insertion of a domain into a single member on the Neighborhood Correlation score is typically small because promiscuity and sequence conservation within domain superfamilies are inversely related. For example, Pfam domains exhibit a highly significant, negative correlation between domain promiscuity (see Methods) and sequence identity (ρ = −0.21, p = 2.08e−30, Spearman test). This can be understood by observing that when a domain is inserted into a new context, it is likely to experience new selective pressures leading to rapid mutational change.
To see how these principles play out in practice, we consider the neighborhoods of PDGFRB, PRKG1B, and NCAM2 in the sequence similarity network derived from our test dataset (Figures 2 and 4). Although the homologous pair, PDGFRB and PRKG1B, and the domain sharers, PDGFRB and NCAM2, have pairwise alignments with similar properties (E-value, alignment length, number of shared domains), their neighborhoods in the weighted sequence similarity network are very different. The shared neighborhood of the Kinase homologs PDGFRB and PRKG1B is substantially larger (779 sequences) than their unique neighborhoods (183 and 142 sequences, respectively). The shared neighborhood consists almost entirely of Kinases. The unique neighborhoods are dominated by domain-only matches, due to Ig in the case of PDGFRB and the cNMP-binding domain in the case of PRKG1B. Sequence similarities within these unique neighborhoods are weak; the Pfam models for the Ig and cNMP-binding domains have average sequence identities of 20% and 18%, respectively. Thus, the edge weights (not shown) in the shared neighborhood are strong and well correlated, while the edge weights in the unique neighborhoods are weak, yielding a Neighborhood Correlation score of NC = 0.65.
Figure 4. Differences in neighborhood structure of the sequence similarity network reflect differences in evolutionary history.
Network neighborhoods in which nodes represent sequences. Edges connect pairs with significant sequence similarity. Edge weights reflecting degree of sequence similarity are not shown. (A) The neighborhoods of the homologous pair, PDGFRB and PRKG1B. PDGFRB and PRKG1B share 779 neighbors, mostly Kinases (turquoise nodes). These are strong matches due to a shared kinase domain. PDGFRB has 183 unique neighbors, mostly due to weak matches with Ig domains (green nodes). PRKG1B has 142 unique neighbors due to weak matches with the cNMP-binding domain (red nodes). Other matching sequences are shown in yellow. (B) PDGFRB and NCAM2, a domain-only match, have 232 matches in common. PDGFRB has 730 unique neighbors and NCAM2 has 240, mostly due to Fn3 domains (dark blue nodes).
doi:10.1371/journal.pcbi.1000063.g004Conversely, PDGFRB and NCAM2 are related through domain insertion and have significant sequence similarity due to a shared Ig domain. Their shared neighborhood is relatively small (242 sequences) and comprised primarily of Ig-based matches. These contribute little to the Neighborhood Correlation score of this pair due to low sequence conservation within the Ig superfamily. In contrast, the unique neighborhood of PDGFRB is large (630 sequences), with strong edge weights. For these reasons, PDGFRB and NCAM2 have a Neighborhood Correlation score of 0.29, distinctly smaller than the score for PDGFRB and PRKG1B. Unlike sequence comparison, this clear difference in neighborhood structure can be used to recognize multidomain homology.
A Benchmark Dataset for Multidomain Homology
Evaluation of classification performance requires a trusted set of positive examples (known homologous pairs) and negative examples (pairs known not to share common ancestry). Although benchmarks are available for detection of remote homology (e.g., SCOP [38], CATH [39]), functional similarity (e.g., the Gene Ontology (GO) [59]), orthology (e.g, COGs [40]), and structural genomics ([16],45,[60], and work cited therein), we are unaware of any gold-standard validation dataset for multidomain homology. Our benchmark is designed to be suitable for testing two classification goals: good overall performance on a large set of sequence pairs and consistent performance on individual families with varying properties. To satisfy these needs, we constructed a test set of 1577 sequences from 20 families of known evolutionary origin (Table 1). The families encompass a broad range of functional categories, summarized in Table 2. The full curation procedure is described in Methods and Text S1.
For each family, we identified two sets of sequence pairs: family (FF) pairs, where both members of the pair are in the family, and non-family (FO) pairs, where only one of the two sequences is in the family. Given a family of size k, we obtain k2 FF pairs (the positive examples) and k(N−k) FO pairs (the negative examples). Individual families, which cover a range of functional properties and domain architecture complexity, can be used for family specific tests. In addition, we constructed a test set (ALL) for general performance evaluation by merging all sets of FF and, respectively, FO pairs, yielding 853,465 positive and 40,459,204 negative examples. Performance measurements obtained with this set could be biased by the Kinase family, which is much larger than the other families. We therefore also considered the set of all sequences excluding the Kinases (ALL-Kin), resulting in 32,629 positive and 17,545,558 negative examples.
Our goal is a method that can correctly identify homologs in multidomain families without degrading performance in other types of families. We therefore devised a benchmark to test a range of homology detection challenges, involving single domain as well as multidomain families. Families with complex and varied domain architectures represent the primary challenge undertaken in this study. Such families result from duplication, domain accretion, and further duplication. Some of these families are defined by a single domain that is unique to the family (e.g., Kinase), while others are characterized by a particular combination of domains (e.g., ADAM) or by a conserved set of domains with variations in domain copy number (e.g. Laminin). Modularity in both single and multidomain families can also arise through the presence of sequence motifs, such as subcellular localization signals, transactivation sequences (e.g., Tbox), and functional components that confer substrate specificity (e.g. USP). These motifs can result in matches to unrelated sequences. In addition, promiscuous domains challenge homology identification because they can result in significant sequence similarity but carry little information about gene homology. Promiscuity can confound reliable detection of homologs even in families with conserved domain architectures.
Remote homology detection is a serious challenge that has received widespread attention. In our dataset, this challenge is represented set by FGF, TNF, TNFR, and USP, families that exhibit low sequence conservation. Finally, we considered homologous pairs with short conserved regions. A minimum alignment coverage criterion is frequently imposed to eliminate domain-only matches, reflecting a widely held, but untested belief that homologous pairs have regions of similarity that cover a substantial fraction of their length. To test the robustness of homology detection methods with respect to alignment length, we included single domain families with short conserved regions such as the Tbox family.
Our selection of test families was limited to those for which it was possible to obtain evidence concerning their evolutionary history. Evolutionary evidence was obtained from published articles and/or curation by a nomenclature committee. In the best cases, direct syntenic evidence of vertical descent can be found. In other cases, indirect evidence such as conserved intron/exon structure is used. Phylogenetic evidence can confirm vertical descent, for example, if all domains in a family have consistent phylogenies. However, phylogenetic disagreement between core and auxiliary domains does not rule out homology according to our model. For each, the evidence used is described in Text S1.
Accuracy of Homolog Identification
We evaluated Neighborhood Correlation using our benchmark, and compared its performance with other methods currently in use. We considered performance on multidomain homology identification, as well as overall performance on diverse, heterogeneous datasets. We also used Neighborhood Correlation to predict novel homologous relationships.
Methods Compared
We compared the performance of Neighborhood Correlation with BLAST [61], alignment coverage [27], and PSI-BLAST [58], methods commonly used for assessing homology, as well as Domain Architecture Comparison (DAC), a recently introduced approach that compares sequences by considering their constituent domains [23]–[26],[55].
BLAST gives a measure of sequence similarity based on the optimal local alignment between two sequences. BLAST does not capture gene structure (e.g., domain organization), nor does it reflect additional information that might be derived from suboptimal local alignments. BLAST is widely used, its behavior is well understood, and its scores are easily compared with those from other studies. A great deal of attention has been devoted to tuning BLAST performance and to developing accurate statistical tests. It represents an attractive balance between rigor and speed.
A significant BLAST score is evidence of similarity greater than that expected by chance, but cannot distinguish whether that similarity stems from vertical descent or domain insertion. In order to eliminate domain-only matches, many analyses combine sequence similarity with alignment coverage to identify homologs [28]–[37]. To be considered homologous, sequence pairs must then satisfy a second criterion in addition to significant sequence similarity: the fraction of the sequence length covered by the optimal local alignment must meet a pre-specified threshold. To our knowledge, alignment coverage criteria have never been empirically evaluated. In this work, we demonstrate that such a requirement is highly detrimental to performance overall, and in nearly all tested families.
In the presence of high sequence divergence, BLAST is limited by the amount of information that can be derived from pairwise comparison. To address this problem, approaches based on multiple sequence alignments (MSAs) have been used to increase sensitivity. PSI-BLAST, one of the most widely used examples of this approach, constructs a Position Specific Scoring Matrix (PSSM) through iterative search and has been shown to dramatically improve sensitivity [62]. MSA-based methods are designed to detect remote homology, not multidomain homology. Since sequences with different architectures cannot be aligned, MSA-based methods are not a natural choice for multidomain homology detection. We included PSI-BLAST in our study because it is widely used as a standard for remote homology detection.
In addition to sequence based methods, we considered direct comparison of domain architectures for multidomain homology detection. Each sequence was represented by a linear sequence of Pfam domains. Linker sequences between domains were ignored, as was sequence variation between instances of a given Pfam domain family. The resulting domain architectures were compared based on their domain composition. In a previous study, we proposed and evaluated 21 different methods for comparing domain architectures [23]. These methods considered properties such as the number of shared domains, domain copy number, total number of domains in a protein, domain order, and domain promiscuity. We included the domain architecture comparison strategy that exhibited the best performance from that study in our current study. This method assigns a score to each pair based on the number of shared domains (see Methods), following the rationale that homologous pairs will have more domains in common than pairs related through domain insertion. In assessing similarity, each domain is assigned a weight inversely proportional to its promiscuity. This reflects the assumption that rare domains convey more information about homology than promiscuous domains.
Evaluation Procedure
The performance of each method was assessed via the ROC-n score (Table 3), which represents both false positives and false negatives (see Methods). ROC-n is the area under the Receiver Operating Characteristic (ROC) curve comprised of the top ranking pairs up to the first n false positives. We used n = 100k, where k is family size, corresponding to 100 false positives per query.
Table 3. ROC-100k scores for Neighborhood Correlation, BLAST, PSI-BLAST, and Domain Architecture Comparison for all families.
doi:10.1371/journal.pcbi.1000063.t003In evaluating homology identification methods, we consider two user models. Genome-scale analyses require all-against-all comparison of a large and heterogeneous set of sequences. In order to be suitable for automated, genomic analyses, a method must be robust enough for use without human intervention, deliver consistent behavior on different types of domain architectures, and be fast and easy to use. In this case, the goal is to maximize the total number of homolog pairs that are correctly predicted. A second application is analysis of individual families, where the goal is to obtain good per-family prediction scores over a wide range of families.
To evaluate performance for both user models, we report ROC-100k scores for all pairs (ALL and ALL-Kin), as well as ROC-100k scores for each family. To show how the methods tested behave on proteins with various attributes, we also report the average ROC-100k score per family for single domain families, multidomain families with conserved architectures, and multidomain families with variable architectures.
As a visualization tool, we generated rank plots, which show the scores of all matches to a given query sequence in rank order. Rank plots provide a visual representation of the organizational structure of the network neighborhood of the query sequence, as well as organizational substructure within the family. For example, Figure 5 shows a rank plot for the query sequence PDGFRB, a protein tyrosine kinase. The break in the curve in Figure 5B at NC≈0.8 corresponds to the first match to a Serine/Threonine Kinase, the inflection point at NC≈0.75 corresponds to the first match to a Dual-Specificity Kinase, and the downward plunge at NC≈0.59 corresponds to the first Casein Kinase. Rank plots for each of the 26,197 sequences in our dataset are provided at http://www.neighborhoodcorrelation.org.
Figure 5. Rank plots for the query sequence PDGFRB.
Family and non-family matches are shown in blue and red, respectively. Matches with the Kinase PRKG1B and the non-Kinase NCAM2 are indicated by magenta and green circles. Scores of matching sequences ranked by (A) Neighborhood Correlation score, (B) BLAST score, and (C) PSI-BLAST score.
doi:10.1371/journal.pcbi.1000063.g005Neighborhood Correlation Performance
When all considered classifiers are applied to the aggregate set of sequence pairs (ALL), Neighborhood Correlation dramatically outperforms the other three methods (Table 3, Figures S1 and S2). In the ALL-Kin dataset, Neighborhood Correlation yields better performance than BLAST and PSI-BLAST, but performs slightly worse than DAC. The superior performance of Neighborhood Correlation on the ALL and ALL-Kin datasets demonstrates that its optimal classification threshold is less sensitive to family specific properties than those of BLAST or PSI-BLAST.
When performance on individual families is considered, Neighborhood Correlation is generally more robust than the other three methods. It perfectly classifies twelve families, more than any other method. In addition, in 16 of 20 families, the discriminatory performance of Neighborhood Correlation is better than or equal to that of all other methods. In particular, Neighborhood Correlation obtains the highest average score for both conserved and variable architectures and performs much better on individual multidomain families except for Myosin and Kinesin. For families with high sequence divergence, including FGF, TNF, and USP, Neighborhood Correlation performs better than BLAST, indicating that neighborhood structure can compensate for a low signal to noise ratio in pairwise comparisons of remote homologs. PSI-BLAST also performs well in such cases.
To demonstrate why Neighborhood Correlation is more effective for complex families, we consider its performance on the Kinase family. Figure 5 shows a rank plot of the results of a query with the Kinase PDGFRB. A robust method is expected to rank all Kinase family members before non-Kinase matches. In particular, we examine pairing between the Kinase PRKG1B and the non-Kinase NCAM2, the genes depicted in Figure 2. Neighborhood Correlation exhibits no difficulty separating these pairs. The match with PRKG1B scores substantially higher than NCAM2 (indicated by magenta and green circles, repectively, in Figure 5). In contrast, the BLAST scores for these sequences are indistinguishable, and the PSI-BLAST scores for these sequences are reversed: The match to NCAM2 obtains θ = 3.65e−40, while the match to PRKG1B is much less significant (θ = 1.26e−25). How typical are these examples? As shown in Figure 6, the sequence similarity distributions of FF and FO pairs overlap completely for BLAST and partially for PSI-BLAST. In contrast, the Neighborhood Correlation score distributions for family and non-family matches are largely distinct, with only a limited overlap in the tails of the distributions.
Figure 6. Distribution of scores for all family and non-family pairs in the Kinase family.
Family and non-family matches are shown in blue and red, respectively. (A) Neighborhood Correlation scores, (B) BLAST scores, and (C) PSI-BLAST scores.
doi:10.1371/journal.pcbi.1000063.g006Neighborhood Correlation also delivers robust performance when sensitivity (Sn) and specificity (Sp) are considered independently. For example, when matches to the query sequence PDGFRB are ranked by Neighborhood Correlation score (Figure 5A), a cutoff of NC = 0.3 results in three false positives with only ten false negatives. In contrast, a BLAST threshold of E<3e−10 results in three false positives and 630 false negatives (Figure 5B). The number of false negatives obtained with PSI-BLAST at this specificity is even greater (Figure 5C). More generally, the ROC-n curves for the Kinase family in Figure 7 demonstrate that Neighborhood Correlation achieves both higher sensitivity and higher specificity than BLAST, except at very high specificity, and always outperforms PSI-BLAST by both measures. Neighborhood Correlation simultaneously achieves Sn≈0.85 and Sp≥0.999. At this specificity, Sn≈0.7 for PSI-BLAST and Sn≈0.55 for BLAST.
Figure 7. ROC-100k curves for the Kinase family for all classification methods tested.
ROC-100k curves of Neighborhood Correlation (blue), BLAST (red), PSI-BLAST (magenta), DAC (purple) and alignment coverage (α≥0.3: green, α≥0.6: yellow, α ≥0.8: orange).
doi:10.1371/journal.pcbi.1000063.g007While the other methods considered have strengths specific to particular challenges, Neighborhood Correlation delivers the most reliable and consistent performance on large, heterogeneous datasets. Neighborhood Correlation is, therefore, particularly well suited to automated genome-scale analyses, which require that a single classification threshold be suitable for the vast majority of sequence pairs in a genomic dataset. Moreover, Neighborhood Correlation is robust. The distribution of Neighborhood Correlation scores for all sequence pairs in our dataset (Figure S3) has a flat trough ranging from 0.4 to 0.8. Within this range, the prediction quality will be relatively insensitive to the choice of threshold. A putative set of mouse and human homologs imposed by a threshold of NC≥0.6 on all sequence pairs in our dataset is available at http://www.neighborhoodcorrelation.org.
PSI-BLAST
As expected, PSI-BLAST excels at families with low sequence conservation, such as TNF and USP, and generally performs well on single domain families. However, PSI-BLAST falters on complex multidomain families and on sequences with promiscuous domains. PSI-BLAST's average ROC-100k scores for both conserved and variable multidomain families are inferior to those of both Neighborhood Correlation and BLAST. This is exemplified by PSI-BLAST's poor performance (Figure 5B) when querying with PDGFRB, which has two copies of the highly promiscuous Ig domain. PSI-BLAST's iterative profile construction algorithm incorporates matches to the highly promiscuous Ig domain in the growing alignment, even when a very stringent inclusion threshold (E<10−13) is used. As a result, unrelated sequences that contain Ig domains match the resulting profile with better scores than Kinases without Ig. PSI-BLAST performs better on the Kinase family as a whole than it does on PDGFRB (Table 3) because many Kinases are single domain proteins.
When classification of heterogeneous data is considered, PSI-BLAST's performance is inferior to Neighborhood Correlation on the ALL dataset and to both Neighborhood Correlation and BLAST on the ALL-Kin dataset. This demonstrates that no single PSI-BLAST cutoff is suitable for all families. Indeed, inspection of PSI-BLAST output on individual queries (data not shown) indicates that PSI-BLAST scores tend to vary widely from family to family. PSI-BLAST introduces a clear tradeoff between sensitivity and generality, to the particular detriment of large-scale studies. Moreover, PSI-BLAST is characterized by greater instability and running time than BLAST or Neighborhood Correlation.
Domain Architecture Comparison
Domain architecture comparison performs well on single domain families and on multidomain families with conserved domain architectures (e.g., DVL, Notch, Laminin, and WNT). Like PSI-BLAST, DAC can recognize distant homology because domain architectures are recognized by MSA-based models. The performance of DAC on other families is mixed, however, because it faces a number of challenges that do not arise with the other classification methods.
First, all domain architecture comparison methods are substantially restricted by the limitations of domain detection. In our dataset, 12.7% of sequences do not have domain annotations, resulting in low ROC-100k scores for many families. This explains why single domain families, such as Tbox, which have identical domain architectures, do not achieve perfect ROC-100k scores, contrary to expectations. An additional shortcoming is that domain architecture comparison methods do not capture information in linker sequences or sequence variation within a domain family. Therefore, domain architecture comparison tends to assign the same score to pairs that actually differ in sequence divergence. This explains the long plateaus in the ROC curve for DAC in Figure 7.
A particularly challenging problem for domain architecture comparison is how to effectively distinguish domains that proliferated through gene duplication from promiscuous domains that proliferated through domain shuffling. The number of domain partners, used here, is a typical measure of promiscuity, based on the assumption that this measure reflects the frequency of domain insertion [48]. This measure of promiscuity will inappropriately down-weight a domain that characterizes a family, if the domain happens to be the target of insertions of many other domains. Consider, for example, a sequence with a single domain A that sustains repeated duplication, followed by insertion of different domains into the resulting copies, yielding AB, AC, AD, and so on. Domain A will have a high promiscuity score, although it is never inserted into new contexts. As a concrete example, the Pkinase domain partners with more than 100 different domains. However, the resulting high promiscuity score may be inappropriate since Pkinase lacks many of the other characteristics of promiscuous domains, such as small size and 1-1 phase [17], and is important in defining the Kinase family. This explains why domain architecture comparison performs poorly on the Kinase family.
Alignment Coverage
To assess the effectiveness of alignment coverage in eliminating domain-only matches, we compared ROC-100k scores for sequence similarity alone and combined with alignment coverage (α, see Methods). We considered three alignment coverage thresholds, α≥0.3, α≥0.6, and α≥0.8, that span the range of length cutoffs used in the literature (e.g. [32],[34]). The results (Table 4) show that the addition of an alignment coverage criterion does not improve the performance of sequence similarity. For example, a cutoff of α≥0.3 reduces the ROC-100k score by 25% in the ALL dataset and 23% in the ALL-Kin dataset. When families are considered individually, a cutoff of α≥0.3 decreases the ROC-100k score by at least 10% in one-third of the families. Increasing the cutoff to α≥0.6 or α≥0.8 does not increase performance in any family. Note that although the ROC-100k score for KIR when α≥0.6 is higher than the score for sequence similarity alone, this difference is not significant (p = 0.69).
Start a discussion on this article