Conceived and designed the experiments: WCW FE. Performed the experiments: WCW. Analyzed the data: WCW SMS FE. Contributed reagents/materials/analysis tools: WCW. Wrote the paper: WCW SMS FE.
The authors have declared that no competing interests exist.
Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users.
Sequence homology is a fundamental principle of biology. It implies common phylogenetic ancestry of genes and, subsequently, similarity of their protein products with regard to amino acid sequence, three-dimensional structure and molecular and cellular function. Originally an esoteric concept, homology with the proxy of sequence similarity is used to justify the transfer of functional annotation from well-studied protein examples to new sequences. Yet, functional annotation via sequence similarity seems to have hit a plateau in recent years since relentless annotation transfer led to error propagation across sequence databases; thus, leading experimental follow-up work astray. It must be emphasized that the trinity of sequence, 3D structural and functional similarity has only been proven for globular segments of proteins. For non-globular regions, similarity of sequence is not necessarily a result of divergent evolution from a common ancestor but the consequence of amino acid sequence bias. In our investigation, we found that protein domain databases contain many domain models with transmembrane regions and signal peptides, non-globular segments of proteins having hydrophobic bias. Many proteins have inherited completely wrong function assignments from these domain models. We fear that future function predictions will turn out futile if this issue is not immediately addressed.
Following the request of a collaborator to hypothesize about the function of Eco1, an uncharacterized yeast gene at that time, the application of the full battery of sequence-based prediction tools
The theory of biomolecular sequence homology and its practical application for predicting function for uncharacterized genes by annotation transfer from well-studied homologues is one of the few achievements of theoretical biology that have significance for all fields of life science
This general theme has received two variations. The first is introduced by the notion of the protein domain
The second issue is that many segments do not have globular structures at all
Thus, sequence similarity can either be due to homology (common ancestry) or convergent evolution (common selective pressure). The criterion of sequence similarity for inferring homology is actually applicable only to globular segments and non-globular parts should be excluded from starting sequences in homology searches. The special case with amino acid compositional bias was recognized early and it was always advised to exclude those segments from similarity searches when hunting after distantly related proteins. For the BLAST/PSI-BLAST suite, the SEG program was advised to suppress at least the most obvious low complexity regions
The unsupervised inclusion of transmembrane helices and signal peptide segments in homology searches is especially prone to erroneous addition of unrelated sequences to the sequence family under study since the systematic coincidence of hydrophobic positions creates the appearance of similarity in the hydrophobic pattern, otherwise the key to sequence homology among globular sequence segments
Similar precautions are generally out of scope when protein domain model libraries are applied for function prediction over query sequences, especially in a genome-wide mode. It is desirable to have systematic factors that might cause spurious annotations such as isolated similarities to signal peptides or some types of transmembrane helices be suppressed during the annotation workflow.
When checking domain databases for the inclusion of transmembrane helices and signal peptides into the domain model, we found more than thousand domain instances in Pfam and a couple of examples even in SMART. These hidden Markov models (HMMs) can be a systematic cause of spurious similarity hits especially if the HMM-based sequence scan is applied in the local search mode. In this work, we wish to emphasize that these domain models can also give rise to wrong hits even in the global search mode where the high score from the membrane-helical part can mask the absence of match for the associated globular domains. For support of the reader, database search results, alignments, domain library entry lists and files with “cleanup” domain models as referred to in the following text are provided as supplementary material at the associated WWW site
Since the SMART database
In brief, we recovered the full length protein sequences that contained the segments in a given alignment of SMART version 6, applied 5 TM and 2 SP predictors published in the literature and we checked overlap of predicted SP/TM regions with the alignment segments. For an alignment position to be considered part of a predicted TM or SP region, the respective residue must be included into the predicted range in a critical number of sequences and by a certain number of prediction tools determined by a statistical criterion based on the binomial distribution (significance value 0.05).
For each predicted TM or SP region, we derive a score as the arithmetic mean of the logarithmic probabilities of SP/TM prediction over all alignment columns involved (
In contrast to the Pfam test described below, SMART version 6 alignments contain pleasantly few SP/TM regions. With a TM-score cutoff of ≥−12 (FP rate of 4.67%) and SP-score cutoff of ≥−1 (FP rate of 4.02%), the number of predicted TM helices and signal peptides are 40 and 5 respectively. At the domain level, this translates to 13 problematic domains with TMs and 5 with SPs, respectively (
Domain name | Type | Predicted segments | Validated Segments | Comments |
SM00019 : SF_P (Pulmonary surfactant protein) | TM | 33–58 | 1–58# | The N-terminal propeptide 1–58 of NP_003009 forms a TM when induced by a Brichos domain |
SM00157 : PRP (Major prion protein) | TM | 117–140 | 112–135# | Latent transmembrane region in human prion protein BAG32277 |
SM00665 : B561 (Cytochrome B561/ferric reductase TM domain) | TM | 4–146 | N/a | Intrinsic membrane protein |
SM00714 : LITAF (LPS-induced tumor necrosis factor α factor) | TM | 38–61 | N/a | The LITAF domain appears to have a membrane-inserted motif (although without transmembrane segment) |
SM00724 : TLC (TRAM, LAG1 and CLN8 homology domains) | TM | 10–76; 216–238; 287–307 | N/a | Proof for 8 membrane-spanning segments in Lag1p (NP_011860) and Lac1p (NP_012917) |
SM00730 : PSN (Presenilin, signal peptide peptidase, family) | TM | 5–27; 113–134; 214–285; 600–649 | 4–25#; 115–133#; 214–231#; 241–257#; 260–283#; 602–621#; 628–644# | Out of 10 TM regions shown for human presenilin-1 (AAB46371), 9 are in the domain alignment out of which 7 are predicted here |
SM00752 : HTTM Horizontally transferred transmembrane domain | TM | 12–25; 75–95; 275–294; 338–357 | N/a | Domain is known to have 4 TM regions |
SM00756 : VKc (catalytic subunit of vitamin K epoxide reductase) | TM | 12–30; 104–192 | 13–32#; 142–189# | VKORC1 (Q9BQB6) is a membrane protein |
SM00780 : PIG-X (Mammalian PIG-X and yeast PBN1) | TM | 230–248 | 230–252# | PBN1 (CAA42392) is a type I transmembrane protein in the endoplasmic reticulum |
SM00786 : SHR3_chaperone (ER membrane protein SH3) | TM | 7–111; 167–186 | N/a | Shr3p (NP_010069) has 4 membrane segments |
SM00793 : AgrB (Accessory gene regulator B) | TM | 42–204 | N/a | |
SM00815 : AMA-1 (Apical membrane antigen 1) | TM | 522–527 | 515–602# | Segment missing in structure 1W81_A |
SM00831 : Cation_ATPase_N (Cation transporter/ATPase, N-terminus) | TM | 72–90 | 65–94# | Segment maps to a TM helix of the ß-domain of 1KJU_A |
SM00190 : IL4_13 (Interleukin 4/13) | SP | 1–20 | 1–23# | Annotated as secreted. Segment missing in structure 1ITL_A |
SM00476 : DNaseIc (deoxyribonuclease I) | SP | 1–19 | 1–17# | Annotated as secreted. Segment missing in structure 1DNK_A |
SM00770 : Zn_dep_PLPC (Zn-dependent phospholipase C, α toxin) | SP | 4–26 | 1–64# | Annotated as secreted. Segment missing in structure 1OLP_A |
SM00792 : Agouti | SP | 1–19 | 1–89# | Annotated as secreted. Segment missing in structure 1Y7J_A |
SM00817 : Amelin (Ameloblastin precursor) | SP | 11–28 | 1–26# | Protein AAG27036 |
Both the predicted and, if explicitly available in the literature, the validated segments of TM regions or signal peptides are provided in the sequence count of the respective SMART domain alignment. In cases marked with “#”, the sequence positions are with respect to the reference sequence given in the comments.
These 18 predictions were manually validated: (i) If the respective predicted segments were indeed structural helices and not SPs/TMs, they should be part of one of the nearest globular domains in the sequence. The alignment sequences were searched against the sequences with known 3D structure from the Protein Data Bank (PDB) for any significant hits (with the generous Blast E-value≤0.1) and we checked whether the predicted SP/TM region overlaps with the segment covered by the structure. If the predicted SP/TM region was missing in the structure or if it was described as a TM helix in the structural report, we considered the prediction as validated. (ii) Without structural hits, we searched the scientific literature for topological information about membrane embedding of reference sequence segments.
As the information collated in
In SMART version 6, the total number of domains with predicted SP/TM segments peaks at 18, which made up 2.2% of 809 SMART domains (see top). Red triangles mark time points for the years 1998, 2002 and 2009 when the total number of domain models was 86, 600 and 809 respectively. In Pfam, the total number of problematic domains peaks at 1214, which made up 11.8% of 10340 Pfam domains (see bottom). Likewise, red triangles marked the years 1999, 2002 and 2008 with 1465, 3360 and 10340 Pfam entries respectively.
Given that our SP/TM detection procedure provides statistical error measures for the prediction, it can be reasonably applied on the body of Pfam domain models. When this work was started, the available Pfam version was release 23 constructed with the HMMER2 package. About 19% (1937 out of 10340) of Pfam-A domains in release 23
The top part shows the histogram of average log probability per predicted transmembrane helix; the bottom part shows the same per predicted signal peptide. The log probability provided on the x-axis is calculated with equations 5 and 6. At the
At the domain level, this implies 1079 (10.4%) and 164 (1.6%) out of 10340 Pfam-A domains having TM or SP regions included into the domain alignment (
The top part shows the average log probability per predicted transmembrane helix calculated per domain; the bottom part shows the same per predicted signal peptide. Whereas the y-axis shows the log probability in accordance with equation 6 applied over all predicted segments for a given domain, the x-axis represents their cumulative length. At the
Among our 164 domains with SP predictions, we might expect 6.6 (∼7) wrong predictions. On average, each domain with predicted TM regions contains about 3.6 (3849/1079) TM helices, out of which 0.17 (4.67% of 3.6) represent false-positive TM helices. We might expect that about 50 domains out of the 1079 domains are wrongly included into this list. Even if we remove those values from the total number of 1214 problematic domain models (1050 TM, 135 SP and 29 concurrent TM and SP errors), Pfam-A release 23 still contains more than 1001 critical cases as claimed in the title of this article.
The domain alignments in Pfam and SMART are used for the derivation of hidden Markov models (HMMs) that, in turn, are applied for searching matches in query sequences with programs of the HMMER packages
With SP/TM regions as part of the domain alignment, the respective HMMs are no longer useful for local mode searches since a match in the TM or SP region alone without any other sequence similarity to the query sequence can be sufficient to cause a false-positive fragmentary domain hit as in the introductory case of “Alt a 1”. Further illustrative examples are provided in
We show illustrative examples for six Pfam release 23 models: Herpes_glycop_D (PF01537.9), CDC50 (PF03381.7), Cation_ATPase_N (PF00690.18), GSPII_F (PF00482.11), PAP2 (PF01569.13) and HCV_NS4b (PF01001.11). The black boxes denote the problematic domain annotations in the respective sequences. Additional material such as hmmpfam outputs and alignments are available at the associated BII WWW site for this work. Domain architecture illustrations were created with DOG 1.5
One of the referees brought up the argument that some of the sequences in
The model Herpes_glycop_D (PF01537.9) has a membrane-helix region that, together with its linkers on both side, are the sole part of a match in the fragmented search mode for a large variety of taxonomically and functionally diverse proteins out of which eight architectures are presented here. Similarly, the TM region (plus surrounding polar linkers) of model CDC50 (PF03381.7) significantly hits proteins with at least three different architectures in the fragmented HMM search.
For another 4 domain models Cation_ATPase_N (PF00690.1), GSPII_F (PF00482.11), PAP2 (PF01569.13) and HCV_NS4b (PF01001.11) provided as further illustration examples, the respective TM region hit a single TM helix segment of several seemingly unrelated proteins. In all cases, their alignment scores were above their family-wise gathering score thresholds.
Not surprisingly, the global search mode that forces a complete match of the domain model over a subsegment of the query sequence is the standard regime for running hmmsearch and hmmpfam of the HMMER2 package. Typically, a positive hit is recognized either by a score above a so-called gathering threshold (which is supplied together with and determined empirically by the creator of the Pfam domain model) or an E-value below a trusted limit (such as 0.1, see page 23 of the HMMER2 user guide). It is particularly worrying that a number of domain models with SP/TM regions included generate quite convincing E-values for unrelated sequences even in the global search mode. In all these cases, matches of a hydrophobic region in the query with the hydrophobic segments of these validated SP/TM regions is the reason for the elevated score that frequently surpasses even the gathering score threshold.
To investigate the effects of SP/TM regions in homology searches, two separate HMM searches against the NR database were performed for each domain under study. The first run relied on an HMM using the original alignment. For the second run, we constructed a “cleanup” alignment via the removal of the predicted TM or SP segments. The two HMMs for the hmmls style of search (global with respect to the domain and local to the query sequence) were built from the alignments using the commands ‘hmmbuild –F –amino model-file alignment-file’ and ‘hmmcalibrate –seed 0 –num 5000’. When contrasting the results of the two HMM runs at E-value≤0.1, we assume all hits of the cleanup model as true-positives and scrutinize all additional hits of the original model as potential false-positive hits. We screened them for potentially contradictory annotation using sequence-analytic tools
Findings for nine Pfam release 23 models Pig-P (PF08510.4), PAP2(PF 01569.13), EMP24_GP25L (PF01105.15), PTPLA (PF04387.6), Lamp (PF01299.9), MttA_Hcf106 (PF02416.8), HAMP (PF00672.17), Nodulin_late (PF07127.3) and GRP (PF07172.3) are shown. The black boxes denote the problematic domain annotations in the respective sequences. Additional material such as hmmpfam outputs and alignments are available at the associated BII WWW site for this work. Domain architecture illustrations were created with DOG 1.5
The model PIG-P (PF08510.4) includes a segment with TM helices (positions 1–91) and hydrophilic region (positions 92–208). In the global-mode search against the non-redundant database, the first 100 alignment positions of the model (i.e., the N-terminal part with the 2 TM helices) hit a pair of C-terminal TM helices in the four protein targets listed in
The PAP2 (type 2 phosphatidic acid phosphatase) domain model (PF 01569.13) hits the sequence XP_418136.2 (
The members of the EMP24_GP25L family (PF01105.15,
In the model for PTPLA (PF04387.6,
The Lamp domain (PF01299.9,
The typical architecture of MttA_Hcf106 (PF02416.8,
HAMP protein segments comprise of two α-helices connected with a linker having a characteristic motif
The architecture of the Nodulin_late domain (PF07127.3,
Further, the domain model GRP (PF07172.3,
Our final example illustrates the issue with multiple TM segments. If the linkers between them differ among query and model, the gap penalties offset some part of the score accumulated by the hydrophobic position matches. The case of claudin proteins, small membrane glycoproteins with 4 TM helices and a length below 200 AA, is instructive in this respect. In a global search mode with the PMP22_claudin model (PF00822.12), the respective HMM hits numerous sequences of γ-subunits of voltage-dependent Ca-ion channels with E-values in the order of e-7. Closer inspection of the seed alignment showed that just a single channel sequence (CCG2_mouse) was included although they are not related to the family
The decrease in specificity of domain models harboring SP/TM regions is also accompanied by a decrease in sensitivity. In general, the need to have additional good alignment scores for the SP/TM pieces can become a burden for any true-positive sequences that are incompletely sequenced or missing the SP/TM-region pieces naturally.
By contrasting the HMM runs between the original and cleanup models, potential false-negatives were identified as hits that were found only by the cleanup models. Then (see
In
It was already suggested in the literature that unsupervised annotation transfer based on spurious sequence similarities has created a myriad of false function annotations for sequences from genome projects
We explored this issue for PIR (
Domain Name | Type, validated region of model (size) | No. of retrieved sequences | No. of FP hits where |
No. of annotations without hmmpfam hits (E>10) | Total No. of unjustified hits (%) |
PF00690.18 : Cation_ATPase_N (Cation transporter/ATPase, N-terminus), |
TM,66–87 (87), ref. |
3684 | 74 | 3 | 77 (2.1%) |
PF01105.15 : EMP24_GP25L (Endoplasmic reticulum and golgi apparatus trafficking proteins), |
TM,141–167 (167), ref. |
1029 | 8 | 33 | 41 (4.0%) |
PF01299.9 : Lamp (Lysosome-associated membrane glycoprotein), |
TM,304–340 (340), ref. |
164 | 2 | 12 | 14 (8.5%) |
PF01544.10 : CorA (CorA-like Mg2+ transporter protein) |
TM,341–407 (407), ref. |
2717 | 15 | 71 | 86 (3.2%) |
PF01569.13 : PAP2 (type 2 phosphatidic acid phosphatase) |
TM,102–177 (177), ref. |
5231 | 108 | 19 | 127 (2.4%) |
PF02416.8 : MttA_Hcf106 (sec-independent translocation mechanism protein) |
TM,1–19 (74), refs. |
2085 | 283 | 0 | 283 (13.6%) |
PF04387.6 : PTPLA (protein tyrosine phosphatase-like protein), |
TM,89–168 (168), refs. |
277 | 3 | 3 | 6 (2.2%) |
PF04612.4 : Gsp_M (General secretion pathway, M protein) |
TM,1–40 (165), ref. |
401 | 19 | 6 | 25 (6.2%) |
PF07127.3 : GRP (plant glycine rich proteins) |
SP,1–49 (134), ref. |
207 | 12 | 4 | 16 (7.7%) |
PF08294.3 : TIM21 (Mitochondrial import protein), |
TM,1–36 (157), ref. |
118 | 7 | 1 | 8 (6.8%) |
PF08510.4 : PIG-P (phosphatidylinositol N-acetyl-glucosaminyl transferase subunit P), |
TM,1–67 (153), ref. |
143 | 4 | 0 | 4 (2.8%) |
In the first column, we list selected Pfam domains with their accession, identifier, description and their gathering score (as in Pfam release 23) that have TM and/or SP regions included into the model. The region in the domain alignment that includes the validated SP/TM segments (together with interlinking loops as described in
Additional material such as hmmpfam outputs and alignments are available at the associated BII WWW site for this work.
The fact that signal peptide or transmembrane helix segments are of lower sequence complexity than their globular counterparts is not widespread general knowledge. To our current understanding, there is only a comment about this issue in the BAliBASE article of Bahr
In brief, we extracted all sequences from Uni-Prot (release 14.4) with the feature keys “signal” and “transmem”. Among the single-transmembrane proteins, we selected those characterized as “anchor” in a special group. For multi-TM region proteins, we selected those who have 5–9 annotated TM segments. Additionally, we got the experimentally verified α-helical TM regions as provided by TMPDB (release 6.3)
In our calculations, we find that only 3% of residues in α-helices in globular domains are covered by hits of the quite stringent low complexity tool SEG (parameters window 12, 2.2, 2.5)
Interestingly, the values for the Uni-Prot sets are 30% for single transmembrane proteins, 33% for single transmembrane proteins with the region annotated as “anchor” but only 12% for multi-transmembrane proteins. Thus, the problems with non-relevant matches in hydrophobic regions are more likely to occur, as a trend, in proteins having signal peptides or only a few transmembrane segments compared with cases of multi-membrane-spanning proteins.
There is no substitute for computational methods in large-scale functional annotation of sequence data and sequence similarity as surrogate for homology has to remain a decisive factor for function assignment
The fundamental consideration in this article, namely the difficulty to interpret sequence similarity as a result of similarity of non-globular segments, (especially signal peptides or transmembrane regions) within the current theory of sequence homology, the basis of annotation transfer, goes beyond the specific criticism for a few domain models. In this context, it appears necessary to recall what the notion of a protein domain implies. In the introduction of their article, Veretnik
In the special case of globular domains that have tertiary structure, sequence similarities imply sequence homology as well as fold and function similarity. If 3D structures are known, domains as compact (having an own hydrophobic core) and spatially distinct units of protein structures that share significant structural similarity can be grouped together (for example, in libraries such as SCOP
Although structure-based domain libraries aim at providing complete and well-defined annotation about a domain, the antecedent of requiring structural information and associated function makes it exclusive for only a small number of well-studied proteins. Thus, many more proteins in sequence databases remain difficult to characterize under this definition.
Meanwhile, a complementary domain definition based on the sequence homology also evolved independently. In the sequence-analytic context, domains as the basic components of proteins are families of sequence segments of minimal length (i) that are similar to each other with statistical significance, (ii) that provide for a specific biological function at the molecular level (“atom” of molecular function
It is crucial to note that similarity of sequences can either be due to homology (common ancestry) or convergent evolution (common selective pressure due to physical requirements or biological function). We wish to emphasize that generally applied sequence-statistical criteria for deducing homology have been derived from studies of globular domains. In these cases, conservation of an intricate, only apparently random hydrophobic pattern is necessary for composing the hydrophobic core and, thus, for fold conservation
This condition is generally not fulfilled for non-globular segments (e.g., transmembrane helices, signal peptides, inter-domain linker regions, segments carrying lipid-attachment sites, etc.); thus, their functional annotation requires other methods than just annotation transfer based on position-wise sequence similarity. It appears likely that many types of non-globular segments re-occurring in evolutionary very distant proteins are rather the result of convergent evolution than common ancestry; for example, the likelihood of a
In a generalized theme, SP/TM segments are usually the results of physico-chemical constraints and do not confer the specific biological function of the protein. Therefore, missing alignments in the SP/TM regions is less detrimental than that of the non-SP/TM regions if the membrane-embedded region is just used as translocation signal.
To further the argument, in the framework of HMM, there is no clear demarcation of SP/TM and non-SP/TM regions towards the computation of the alignment scores. Hence, this questions the correctness of inclusion of SP/TM regions into the HMM or, at least, makes a separate consideration for them a matter of necessity in the context of the homology argument.
Our arguments raise the question whether position-specific scoring matrices (PSSM), HMMs or profiles are indeed the appropriate tool to classify all kinds of non-globular segments with regard to sequence homology. Matching the hydrophobic pattern alone is recognized insufficient for inferring homology among proteins with transmembrane helices. In previous reports
The case of SPs/TMs is of special importance since their hydrophobic stretches can create the false appearance of similarity to the respective hydrophobic core of the target template based on a hydrophobic pattern match. Alignments with many hydrophobic residues in the same columns generate high scores; thus, a SP/TM match can elevate an otherwise mediocre HMM score into the range of significance. The inclusion of a SP/TM into the domain model can compromise the selectivity of HMMs towards specific families and create hits not only to neighboring sequence families within the superfamily but also beyond. Whereas errors of the first kind might be considered not dramatic, we show with examples in
Thus, the reliability in homology inference is greatly influenced by the amount of non-globular content in such domain library entries. We find that, even in the very well curated SMART domain collection (version 6), there are 18 domain models (out of 809) that include TMs or SPs. Based on our conservative approach, we find that clearly more than 1000 domains harbor SP/TM segments in Pfam release 23 (out of 10340 entries). To make matters worse, we observe a growing trend of addition of SP/TM region-containing domain models in Pfam and especially in SMART during the recent years (
In the
Therefore, our finding might suggest the mandatory removal of SPs/TMs from domain models. We do not recommend this at this stage. Such a strategy is not easy to implement due to several reasons. The required editing of domain libraries given their current status would be quite laborious and appears impractical in the short term. Then, there is also the issue with some multi-TM region protein domain models where there is little or no soluble globular component. Further, the biological significance of sequence similarity of proteins with TM regions and its relationship to homology has been studied only in a few cases
Notably with regard to signal peptides, the Pfam team has conveyed to us the removal of signal peptides in most domain models for future releases (Alex Bateman, personal communication). Similarly, it appears reasonable to remove TM regions from models where they are not integral parts of the globular domain and, especially, where the domain occurs also outside the TM region context. An excellent match between SP/TM regions of non-relevant proteins is possible just because of their uniform hydrophobicity and this match will elevate scores in alignments. Often, this might be insufficient to overcome thresholds of significance but, as we see in our experience, it can happen and it happens systematically for some types of models. Most likely, the problems arise with domains having one or very few TM regions which are the majority of cases in Pfam (366 with 1 TM helix, 170 with 2 TM helices, 127 with 3 TM helices, 416 with more than 4 TM helices as with our conservative estimates). As we have seen, the trend to low sequence complexity is especially strong for proteins segments representing a signal peptide or a single-TM anchor. Both the exclusion of signal peptides and of transmembrane helix anchors from domain models would remove the bulk (but not all) of the problems described in this article. Among all SP/TM regions, signal peptides, signal anchors and single TM regions have a trend to considerably more pronounced sequence complexity than TM regions in multi-TM proteins (see
In addition, we propose two other possible workarounds: First, one might process each query sequence with tools recognizing non-globular segments including those for SP/TM regions and mask them with X-runs before comparing the query with domain libraries. Yet, this would not exclude cases such as SPs in HMMs hitting structural helices (see the GRP example CAL51691.1 from
Whereas this work explores the issue of SPs/TMs in domain models mainly based on an analysis of HMMER2 and Pfam release 23, both have concurrently been updated to HMMER3 and Pfam release 24
As a remedy, switching from the E-value guided hit finding to gathering score thresholds is proposed. This is problematic from several viewpoints. The HMM concept has the beauty of a rigorous probabilistic formulation that allows a natural treatment for substitutions and gaps in the same formalized framework. Further, the introduction of E-values provides a handle to compare various types of predictions that hit the same sequence region. Unfortunately, the gathering score concept (an expert-defined domain-specific score threshold for homologous hit selection) brings in an arbitrary component into the prediction process.
Firstly, the determination of a gathering score is not guided by a fundamental consideration but, instead, depends on the data and literature situation at the time of seed alignment collection. Regardless, how carefully a gathering score is selected by the expert, it remains a subjective decision. The sequence with a true model hit with lowest score (as well as the false hit with the highest score) critically depends on the size of the non-redundant protein database, the variety of sequences therein and the quality of the seed alignment at the time of model construction. Sequence databases have a strong growth due to increasingly cheaper sequencing. With time, our biological knowledge grows and we know more about previously uncharacterized sequences. Not surprisingly, gathering thresholds have an inherent trend to be increased with time even if the underlying seed alignments do not change.
For example in the case of PF00583 (Acetyltransferase) in the introductory Eco1 example, the gathering scores have evolved the following way: Pfam5 (1999) with 6.5 (global mode/gm) and 6.5 (fragment- mode/fm), Pfam6 (2000) with 15 (gm) and 15 (fm) (with some shortening of the alignment compared with Pfam5), Pfam7 (2001) with 18.2 (gm) and 16.3(fm). The reader is invited to return to
Secondly, gathering scores hide the problem of balance between true-negative and false-positive hits. Although increasing gathering scores (as there is a trend in Pfam releases) reduce false-positive hit rates, this approach excludes a growing number of true hits and, thus, also limits the extrapolation power of domain models into the space of uncharacterized sequences. On the contrary, an E-value gives insights into the orders of magnitude of error rates when assuming the annotation transfer to be correct. The user of a gathering threshold guided assignment does have the illusion of dealing with ultimately correct hits; in contrast, an E-value provides a quantitative and typically non-zero statistical measure for annotation error.
Thirdly, gathering thresholds do not relate well with the statistics of hit distribution in the non-redundant database. In the HMMER2 manual, Sean Eddy says on page 22 “Calibrated HMMER E-values tend to be relatively accurate. E-values of 0.1 or less are, in general, significant hits”. Further on page 43, he writes “The best criterion of statistical significance is the E-value. The E-value is calculated from the bit score. It tells you how many false positives you would have expected to see at or above this bit score. Therefore a low E-value is best; an E-value of 0.1, for instance, means that there's only a 10% chance that you would've seen a hit this good in a search of non-homologous sequences.
Whereas the E-values in the order of 0.1 are generally considered being below the significance threshold (and they are for many good domain models as we observed in our practice), we find actually no general relationship between domain-specific gathering scores and E-value thresholds for Pfam release 23 (
Whereas the y-axis shows the gathering score threshold (GA) for the global-mode search, x-axis shows the corresponding E-value threshold (in decimal log scale) calculated with the domain-specific extreme-value function with parameters provided in the corresponding HMM file (for an NR database size of 7365651 sequences) for this score. The upper plot represents the distribution for 9126 domains without detected SP/TM region, the middle part shows the same for the 1214 domains with SP/TM problems. Effectively, there is no clear correlation between gathering score and E-value threshold. If E-values close to 0.1 are considered significant, all dots should be close to the “−1” line (horizontal dashed lines) in this graph and, indeed, there is some agglomeration of data points in that area; yet, there are numerous outliers. Note that the E-values are computed using the equation
Lastly, E-values are comparable since they are a statistical measure but gathering score thresholds are not and, therefore, scores calculated from different domain models or prediction tools cannot be compared. This makes decisions among domain models and other prediction tools hitting the same segment in the query difficult. For example, the sequence XP_001939830.1 (
We do not want to create the impression that we wish to nail down the Pfam team on, maybe, some unfortunately selected thresholds for previous releases. Also, the specific examples (rather the existence of such examples) are not relevant for the conclusions in this paper. We have to live with some error rate. In contrast, it is important that the theoretical fundamentals are reliable, that systematic causes for possibly questionable annotations are increasingly suppressed and that, together with the Pfam team, the community develops the theory.
It is difficult to assess the total amount of wrong annotations currently persisting in public sequence databases since most of the protein sequences have never been a target of experimental study. With regard to theoretically derived function descriptions, the individual teams contributing to sequence databases, apparently, apply criteria with differing stringency and rigor. It appears that unrestrained annotation transfer justified by spurious sequence similarities is a major cause for annotation errors
In their analysis of database annotations for 37 enzyme families, Schnoes
Thus, the criteria for sequence homology in their present form appear not directly applicable to non-globular segments. SPs/TMs as part of domain models lead to pollution of database annotations as our PIR iProClass v3.74 analysis demonstrates. As a matter of fact, it is very difficult to prove wrong annotation for experimentally uncharacterized sequences otherwise than by detecting logical contradictions. Whereas the examples in
To conclude, sequence similarity among non-globular protein segments does not necessarily imply homology. Since matching of SPs/TMs creates the illusion of alignable hydrophobic cores, the inclusion of SPs/TMs into domain models without precautions can give rise to wrong annotations. We find that clearly more than 1001 domains among the 10340 models of Pfam release 23 suffer from this problem, whereas the issue is of relatively low importance for domains of SMART version 6 (18 out of 809). As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins for these models. More worryingly, we show explicit examples that the scores of clearly false-positive hits even in global-mode searches can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74, we find that between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. We suggest a workflow of flagging problematic hits arising from SPs/TMs-containing models for critical reconsideration by annotation users. On the other hand, we have also seen that the inclusion of SP/TM regions into domain models can give rise to false negatives by imposing the need to have good scores over these regions in the query sequences when the actual domain occurs without the SP/TM context.
It is well known that the problem of transmembrane helix prediction is not so much the detection of true hits as the suppression of false-positives
In the general case, domain models are characterized by both seed and full alignments. We think that, in our context, operating with seed alignments is preferable since they are manually validated and are supposed to have lower levels of inclusions of unrelated sequences.
For a given domain model alignment, each sequence was subjected to sets of transmembrane (TM) and signal peptide (SP) segment predictors. We have used the following TM predictor tools – DASTM
For each predictor
To ensure that columns of domain alignments with an unequal number of sequences and/or gap instances are treated comparably, a hypothesis testing step is introduced
We assume the null hypothesis is rejected at a significance level of
In practice, some of the predicted transmembrane helices and signal peptides can be fragmented due to small gaps in the alignment. In the case of signal peptide fragments, it is reasonable to assume that all the fragments come from a single signal peptide. Consequently, the average logarithm probability of SP prediction per domain is simply calculated using (5) summing over the smallest region that contains both the N-terminal alignment position and the C-terminal boundary of the most C-terminal predicted segment.
However, for the case of the fragmented TM helices, the situation can be complicated by occurrences of multiple transmembrane segments within the alignment. As indicator which fragments to unite into one segment, we use the raw TM predictions. The indicator function
We have used our algorithm also to find SP/TM regions in α- and membrane proteins classified by SCOP
For the TM prediction problem, only the individual TM helix has been defined so far. To define a TM region that composes of one or more TM helices, adjacent TM helices separated by less than 40 amino acid residues are concatenated to form a region. The choice of 40 amino acids is based on the current knowledge that the smallest known globular domains such as Zinc fingers
For the SP prediction problem, it is relevant that the actual N-terminus might be missing in the domain alignment. Thus, two rounds of SP predictions are necessary. After the initial round, the domain sequences with positive SP predictions are subjected to blastp runs (with parameters ‘-M BLOSUM62 -G 11 -E 1 -F F -I T’) against NR database to retrieve their full sequence data. Only the full sequence data with percent identity ≥95% and Blast E-value ≤0.01 are then subjected to SP predictions. Finally, only overlapped SP predictions that are confirmed in both rounds are retained for further processing.
The appropriate cutoff for predicted TM and SP segments in domain alignments have been determined with the help of the SCOP v1.75
TM prediction hits among SCOP α class proteins are false-positives since the database contains predominantly structural helices. On the other hand, the membrane class contains mostly TM helices that made up the true-positive hits for these predictors.
The top (average log probability per predicted transmembrane helix for SCOP v1.75 α-proteins class) and bottom (average log probability per predicted transmembrane helix for SCOP v1.75 membrane protein class) histograms represent the false-positive and true-positive distributions for TM predictions respectively. The total number of predicted structural and membrane helices is 2293 and 5592 respectively.
Average log probability of TM prediction | No. of FP | FP rate (%) | No. of FN | FN rate (%) |
≥−6 | 21 | 0.91 | 4519 | 80.81 |
≥−7 | 37 | 1.61 | 3401 | 60.82 |
≥−8 | 45 | 1.96 | 2520 | 45.06 |
≥−9 | 47 | 2.04 | 1593 | 28.49 |
≥−10 | 72 | 3.14 | 910 | 16.27 |
≥−11 | 84 | 3.66 | 526 | 9.41 |
≥−12 | 107 | 4.67 | 418 | 7.48 |
≥−13 | 125 | 5.45 | 381 | 6.81 |
≥−14 | 206 | 8.98 | 362 | 6.47 |
The first column gives the various cutoffs for the average log probability of TM helix prediction (refer to equations 5 and 6). The next two columns denote the number and percentage of false-positive TM helices with respect to 2293 predicted helices from SCOP α-proteins based on the corresponding cutoff rate. Similarly, the last two columns describe the number and percentage of false-negative TM helices with respect to 5592 predicted helices from SCOP membrane proteins.
In the case of the signal peptide prediction, both α- and membrane SCOP classes will deliver false-positive hits while the domain models from SMART with signal peptide are true positive hits.
The top (average log probability per predicted signal peptide for SCOP v1.75 α- and membrane protein class) and bottom (average log probability per predicted signal peptide for SMART version) histograms represent the false-positive and true-positive distributions for the SP predictions respectively. The total number of predicted signal peptides for SCOP α- and membrane proteins is 193 and 379 respectively, while the total number for SMART is 45. All except SM00817 Amelin (no available structure) were validated against their respective PDB entries.
Average log probability of SP prediction | No. of FP | FP rate (%) | No. of FN | FN rate (%) |
≥−0.5 | 20 | 3.50 | 8 | 17.78 |
≥−1 | 23 | 4.02 | 1 | 2.2 |
≥−2 | 38 | 6.64 | 1 | 2.2 |
≥−3 | 38 | 6.64 | 1 | 2.2 |
≥−4 | 44 | 7.69 | 1 | 2.2 |
The first column gives the various cutoffs for the average log probability of SP prediction (refer to equation 5). The next two columns denote the number and percentage of false-positive SP with respect to 572 predicted SP from SCOP α- and membrane proteins based on the corresponding cutoff rate. Similarly, the last two columns describe the number and percentage of false-negative SP with respect to 45 predicted SP in seed sequences from SMART version 6 alignments.
In the following, the reader is assumed to be familiar with chapter three of
Here, equation (12) that denotes the total score
For a set of sequences with a common problematic domain annotation, each sequence score can be represented by
We find that our derivation for
PF00583 hits leading to the Eco1 function discovery.
(0.03 MB PDF)
False-positive hit of PF00497 in Alt a 1.
(0.01 MB PDF)
Mini-site with supplementary information, archive created with WinRAR (to be downloaded from
(32.83 MB WinRAR)
Summary of selected sequence hits with problematic domain annotations (fragment-mode search).
(0.04 MB PDF)
Summary of selected sequence hits with problematic domain annotations (global-mode search).
(0.05 MB PDF)
Summary of selected false-negative sequence hits with problematic domain annotations (global-mode search).
(0.05 MB PDF)
The authors are grateful to Birgit Eisenhaber for critically reading this manuscript. It is also acknowledged that this work has been made available to the teams of Pfam (via Alex Bateman) and SMART (via Peer Bork) prior to publication. As a consequence, a considerable number of domain model revisions have been made leading, for example to the exclusion of signal peptides from models in future Pfam releases.