Analyzed the data: RP CM. Wrote the paper: RP CM HO. Conceived and designed the study: RP CM HO. Contributed analysis tools: RP CM. Processed the databases and gathered the data: RP.
The authors have declared that no competing interests exist.
Knotted proteins, because of their ability to fold reversibly in the same topologically entangled conformation, are the object of an increasing number of experimental and theoretical studies. The aim of the present investigation is to assess, on the basis of presently available structural data, the extent to which knotted proteins are isolated instances in sequence or structure space, and to use comparative schemes to understand whether specific protein segments can be associated to the occurrence of a knot in the native state. A significant sequence homology is found among a sizeable group of knotted and unknotted proteins. In this family, knotted members occupy a primary sub-branch of the phylogenetic tree and differ from unknotted ones only by additional loop segments. These “knot-promoting” loops, whose virtual bridging eliminates the knot, are found in various types of knotted proteins. Valuable insight into how knots form, or are encoded, in proteins could be obtained by targeting these regions in future computational studies or excision experiments.
Out of the tens of thousands of known protein structures, only a few hundred are knotted. The latter epitomize, better than unknotted proteins, the degree of coordinated motion of the backbone required to fold reversibly in a specific native conformation, which indeed must contain a precise knot in a specific protein region. In the present work we search for salient features associated to protein “knottedness” through a systematic sequence and structure comparison of knotted and unknotted protein chains. A significant sequence relatedness is found within a sizeable group of knotted and unknotted proteins. Their tree of sequence relatedness suggests that the knotted entries all diverged from a specific evolutionary event. The systematic structural comparison further indicates that the knottedness of several different types of proteins is likely ascribable to the presence of short “knot-promoting” loops. These segments, whose bridging eliminates the knot, are natural candidates for future experimental/computational studies aimed at clarifying whether the global knotted state of a protein is influenced by specific regions of the primary sequence.
Since the early 90's, when the first crystal structures of knotted proteins became available, the number of known knotted protein chains has increased to comprise several hundred PDB
Even before the discovery of knotted proteins, the possible existence of non-trivial topological entanglements, or lack thereof, in proteins was a matter of debate
In support of this view it should be stressed that proteins differ from globular flexible polymers not only in terms of the low incidence of knots but especially because, in the absence of any specific cellular machinery, the same knot type is formed reversibly and reproducibly in the same protein location
These considerations have stimulated an increasing number of experimental and theoretical studies aimed at understanding the kinetic and thermodynamic processes leading to knot formation in proteins or the implications for the molecular mechanical stability
The present work aims at complementing the insight offered by these studies through a systematic quantitative comparative investigation of knotted and unknotted proteins.
Our first aim is to assess, on the basis of available PDB entries, the level of sequence and structure discontinuity between knotted and unknotted proteins. The question is tackled by means of a systematic search of significant sequence- and structure-based correspondences between knotted and unknotted protein pairs. The second aim is to obtain clues about the possible mechanisms leading to the formation of knotted native states by searching for salient systematic differences between knotted/unknotted protein pairs.
Indeed, the PDB-wide sequence and structural comparison indicates that various types of protein knots are associated to the presence of loop segments that are absent from sequence-homologous or structurally-similar unknotted proteins. The removal (virtual bridging) of these segments, which include a region of a knotted transcarbamylase previously identified by Virnau
Based on these observation it can be expected that valuable insight into the way that knots form, or are encoded, in proteins could be obtained by targeting these regions in future
The 1.2
The set of all knotted proteins found in the PDB is highly redundant; for example, as many as 194 of the 229 knotted proteins, are carbonic anhydrases. The primary sequence comparison of the entries revealed that less than 50 chains are non-identical in sequence. The dataset was hence processed to achieve a uniform, minimally-redundant, coverage in sequence space. The culling procedure returned 11 representative knotted chains, which are listed in
Name | PDB | Knot type | CATH | EC | Knotted Region |
hypothetical protein | 2efvA | 6–86 | |||
plasmid pTiC58 VirC2 | 2rh3A | 82–194 | |||
N-succinyl-L-ornithine transcarbamylase (SOTCase) | 2fg6C | 01:3.40.50.1370 |
149–257 | ||
methyltransferase (MT) domain of human TAR (HIV-1) RNA binding protein (TARBP1) | 2ha8A | 83–167 | |||
alpha subunit of human S-adenosyl-methionine synthetase (SAM-S) | 2p02A | 01:3.30.300.10 02:3.30.300.10 03:3.30.300.10 | 2.5.1.6 | 38–328 | |
human carbonic anhydrase II (CA2) | 5cacA | 3.10.200.10 | 4.2.1.1 | 11–260 | |
acetohydroxyacid isomeroreductase | 1qmgA | 01:3.40.50.720 |
1.1.1.86 | 302–553 | |
photosensory core domain of aeruginosa bacteriophytochrome (PaBphP) | 3c2wH | 5–302 | |||
ubiquitin carboxy-terminal hydrolase (UCH) | 2etlA | 3.40.532.10 | 3.4.19.12 | 1–233 | |
group I haloacid dehalogenase | 3bjxB | 3.8.1.10 | 46–288 | ||
ribosomal 80S-eEF2-sordarin complex | 1s1hI | 78–125 |
List of the knotted protein representatives. CATH
The simplest knot type,
Among the trefoil representatives in
The more complex knot types,
It is interesting to observe a parallel between the chronological succession of the first PDB release of the various types of protein knots and the complexity of the knots. In fact, the first structures containing
This qualitative consideration is supported by the fact that, in compact flexible polymers, the abundance of the simplest knot types decreases with knot complexity
Finally, we discuss the extent to which knots of different handedness occur among knotted proteins. Apart from the
The investigation of the handedness in this latest dataset, where sequence redundancy has been removed, provides a novel context for examining the problem. As reported in
Simulations of the protein folding of knotted proteins, based on simplified steered dynamics targeted towards the known native state, have reported a much lower degree of efficiency in reaching the native state from an extended conformation compared to unknotted proteins
This consideration is here taken as the motivation for a systematic survey of whether, and to what extent, knotted proteins are discontinuously related by sequence and structure to unknotted ones.
In this section we tackle one facet of the problem. Specifically, we shall examine how primary-sequence similarities reverberate in relatedness of the knotted/unknotted topological state. To this purpose, for each of the 11 representatives in
The BLAST queries were run with a stringent E-value threshold (0.1) for returned matches, so that false positives are not expected to occur appreciably among the returned entries. Only for three protein chains, namely 5cacA, 2fg6C and 2ha8A, the number of significant matches was larger or equal to 10. Incidentally we mention that, consistently with the probable artifactual origin of the knot in entry 1s1hI, all the 10 significant BLAST matches of 1s1hI were unknotted protein chains.
All the returned matches for the 5cacA human carbonic anhydrase and the 2ha8A methyltransferase domain of the human TAR RNA binding protein (TARBP1-MTd), consisted esclusively of a dozen knotted proteins, all with the same knot type. These matches are therefore not informative for the purpose of understanding if and how differences in sequence reverberate into differences of knotted state.
On the contrary, the BLAST matches of the trefoil-knotted N-succinyl-ornithine transcarbamylase (SOTCase), associated to the PDB entry 2fg6C
To advance the understanding of the precise type of sequence relatedness of the SOTCase and its knotted and unknotted homologs, the matching BLAST sequences were used as input for a CLUSTALW multiple sequence alignment
The phylogenetic tree for the SOTCase is represented in
(a) The phylogenetic tree was obtained by applying a neighbor joining algorithm
Amongst the knotted and unknotted entries, the average level of sequence identity is about 20%, with a standard deviation of 7%. Indeed, it is interesting to observe that few knotted/unknotted pairs can have a level of mutual sequence identity even larger than knotted pairs. For example the knotted chain 2g68A has a sequence identity of 33% and 38% respectively, against 1js1X (knotted) and 1pvvA (unknotted).
As, to the best of our knowledge, no previous study had pointed out meaningful relationships of knotted and unknotted proteins, the present results offer a novel insight into the possible mechanisms that have led to the appearance of knotted proteins.
In particular, the phylogenetic tree structure suggests the existence of a simple evolutionary lineage between the sets of knotted and unknotted proteins shown in
Further clues about the biological rationale behind the evolutionary pathways that have led to the emergence/conservation of the knotted structures in
Valuable insight into the fundamental similarities and differences in the entries appearing in the tree of
To this purpose we used the MISTRAL
The proteins appearing in the phylogenetic tree can be all simultaneously structurally-aligned. Their aligned core consists of as many as 192 amino acids, which is a substantial fraction of the full proteins (which have an average length of about 310 a.a.). Over the core region, the average RMSD of any pair of matching amino acids is less than 2 Å. The good structural superposability of the protein set (which we recall includes protein pairs with average mutual sequence identity of about 20%) is exemplified in
The detailed pairwise structural comparison indicates that members of the two knotted branches admit a good structural superposition over the full protein length (and, in particular, over the knotted region).
To highlight the salient differences between the knotted and unknotted entries in the tree we analysed all the pairwise structural superpositions of the knotted SOTCase with the unknotted homologs. This investigation generalises the structural comparative inspection of two specific instances of knotted and unknotted carbamylases carried out in ref.
The results are best illustrated considering the closest matching pair, namely the SOTCase and PDB entry 1ortA.
In spite of their limited mutual sequence identity, which is about 25%, these proteins admit a very good structural superposition, see
SOTCase (a) is shown in cartoon representation; the knot-promoting loop segments are highlighted in orange and purple. The MISTRAL alignment with unknotted entry 1ortA is shown in panel (b): aligned residues are colored in blue and red, respectively, while non aligned residues are correspondingly colored in cyan and pink. Knotted protein TARBP1-MTd is shown in panel (c) with the knot-promoting loop segment highlighted in purple. The MISTRAL alignments of TARBP1-MTd with the unknotted proteins 1b93A and 1hdoA are shown in panels (d) and (e), respectively.
The case is different for two regions of the SOTCase: the proline-rich segment comprising amino acids 174–182, and the segment 235–255; both regions are located in proximity of the active site (residues 176–178, 252). As shown in
We remark that Virnau
The results discussed in the previous section indicate that knotted proteins appear to be sparsely distributed in sequence space. In fact, only for one of the representatives in
Here we investigate whether, irrespective of the level of primary sequence relatedness, there exist meaningful structural similarities between knotted and unknotted proteins.
The search was performed, by carrying out MISTRAL structural alignments of each of the knotted representatives in
Hereafter we focus on a limited number of cases which, regardless of their ranking in alignment quality, can be aptly used to highlight interesting relationships between knotted and unknotted pairs. In particular, they might possibly be used to shed light on important kinetic or thermodynamic mechanisms that guide or otherwise favor the formation of knots in naturally occurring proteins.
In particular, we start by discussing the limited number of cases where the alignment suggests the presence of knot-promoting loop segments, analogously to the case of the SOTCase and chain 1ortA. These segments are identified using two main criteria: (i) the segments ends must be sufficiently close that they could be virtually bridged by very few amino acids; (ii) the bridging/excision operation should lead to an unknotted conformation.
The automated search for such segments returned positive matches for three representatives. One of them was the same SOTCase chain, which we discussed in previous sections. The other chains were the aforementioned TARBP1-MTd and the photosensory core module of
TARBP1-MTd aligns well with two unknotted protein representatives that have very different overall structural organization. Despite the differences, discussed hereafter, the alignments consistently indicate that loop 101–123 is a knot-promoting loop for chain A of TARBP1-MTd.
The alignment against the unknotted protein chain 1b93A
The topologically-important role of the segment is further highlighted by the alignment with the 1hdoA chain. At variance with the case of 1b93A, the good alignment does not involve regions that have the same succession, along the primary sequence, in the two proteins. This is readily ascertained by the inspection of the structural diagram of
The secondary and tertiary organization of the knotted TARBP1-MTd (PDBid 2ha8A) (a) and unknotted protein chain 1hdoA (b), which admit a significant structural superposability, see
The “figure-of-eight” knot in protein PaBphP
The alignment singles out the segment of amino acids 203 to 256 as a knot-promoting loop. Indeed, while the knot length is very large, the knot appears to result from the “threading” of the N-terminal domain through the above mentioned loop. As for SOTCase, the hydrophobicity profile (see
The removal of the loop, as readily seen from
PaBphP (a) and its alignment with the unknotted chain 2b18A (b). In the knotted structure the knot-promoting loop is highlighted in purple, while the N-terminal domain, which threads through the loop, is shown in green. In the bottom panel, the aligned residues of knotted and unknotted proteins are colored in blue and red, respectively, while non aligned residues are correspondingly colored in cyan and pink. The N-terminal PAS domain (green) and C-terminal PHY domain (cyan) are well-separated by the aligned region, which instead covers almost completely the central GAF domain of PaBphP photosensory core module.
The above analysis was based on the identification of knot-promoting regions suggested by significant alignments of the knotted representatives in
Yet, it is interesting to point out that for two other representatives, namely chains 2etlA (ubiquitin carboxy-terminal hydrolase, UCH) and 2p02A (alpha subunit of human S-adenosylmethionine synthetase, hereafter
The two examples are shown in
UCH (a) and its alignment with the unknotted chain 1aecA (b). The aligned residues of the knotted and unknotted protein are colored in blue and red, respectively while unsaturated colors (cyan and pink) are used for non-aligned residues.
Analogous considerations, hold for the alignment of
In this study we presented a database-wide comparative analysis of pairs of knotted and unknotted proteins. The study was aimed at understanding if, and to what extent, the rare instances of known knotted proteins are discontinuously related in sequence or structure space to unknotted proteins.
The analysis proceeded by first identifying minimally-redundant sets for the
In order to understand what type of primary sequence relatedness exists between knotted and unknotted proteins, a PDB-wide BLAST
The insight offered by the sequence comparative investigation was finally complemented by one based on pairwise structural alignments. At variance with the sequence case, the structural one revealed several significant knotted/unknotted correspondences. In an appreciable number of instances, these correspondences involved a substantial fraction of the region where the knot is accommodated. Also in these cases, knotted proteins appeared to differ from the unknotted partner by the presence of knot-promoting segments analogous to those identified in the alignments involving the SOTCase. The results therefore point to the key role that these specific, local, protein segments play for the global knotted topology of the folded protein.
These regions might represent ideal candidates for mutagenesis or excision experiments to monitor the impact of these regions on the process of knot formation.
The PDB database as of December 2009 contained 6.2
Detecting and characterizing the presence of knots in proteins requires a suitable generalization of the mathematical notion of knottedness
In such contexts, at variance with the case of linear open-ended polymers such as proteins, knots cannot be untied by any manipulation preserving the connectivity and self-avoidance of the circular chain. The mathematical concept of knottedness can be extended to protein chains whenever a simple, non-ambiguous way exists to bridge the two termini, such as by prolonging them into an arc that does not intersect the protein hull. Such virtual circularization procedures are actually possible for most protein chains because the N and C termini are usually exposed at the protein surface.
The closure algorithm applied here first performs the identification of those chains with both termini exposed on the surface: this condition is satisfied if one can pass a plane through each terminus, such that all other residues occupy only one of the two subspaces created by the plane. In these cases the chain can be unambiguously closed by adding a segment connecting the termini “at infinity” without intersecting the protein chain.
As many as 6.4 10
The dataset of the 4.5 10
Only 247 protein chains were found to have nontrivial topology. These two sets are affected by a large sequence redundancy, which was removed at the stringent 10% sequence identity level using the web tool developed by Cedric Notredame and available at
The large set of unknotted proteins was processed with the UniqueProt
The publicly-accessible MISTRAL multiple structural alignment tool
All pairwise structural alignments between the representatives of the unknotted and knotted proteins were computed. Among those with a
Hydrophobicity profile of the knotted protein 2fg6C.
(0.24 MB PDF)
Hydrophobicity profiles for the knotted protein 2ha8A.
(0.29 MB PDF)
Hydrophobicity profile for the knotted protein 3c2wH.
(0.18 MB PDF)
List of knotted protein chains.
(0.06 MB PDF)
Top ranking MISTRAL alignments of knotted and unknotted representatives.
(0.06 MB PDF)
We are indebted to Enzo Orlandini, Giulia Rossetti, Luca Tubiana and Peter Virnau for fruitful discussions. We acknowledge support from the DEMOCRITOS CNR-IOM.