The authors have declared that no competing interests exist.
Conceived and designed the experiments: IK JMB. Performed the experiments: IK. Analyzed the data: IK JMB. Contributed reagents/materials/analysis tools: IK JMB. Wrote the paper: IK JMB.
The spliceosome is a molecular machine that performs the excision of introns from eukaryotic pre-mRNAs. This macromolecular complex comprises in human cells five RNAs and over one hundred proteins. In recent years, many spliceosomal proteins have been found to exhibit intrinsic disorder, that is to lack stable native three-dimensional structure in solution. Building on the previous body of proteomic, structural and functional data, we have carried out a systematic bioinformatics analysis of intrinsic disorder in the proteome of the human spliceosome. We discovered that almost a half of the combined sequence of proteins abundant in the spliceosome is predicted to be intrinsically disordered, at least when the individual proteins are considered in isolation. The distribution of intrinsic order and disorder throughout the spliceosome is uneven, and is related to the various functions performed by the intrinsic disorder of the spliceosomal proteins in the complex. In particular, proteins involved in the secondary functions of the spliceosome, such as mRNA recognition, intron/exon definition and spliceosomal assembly and dynamics, are more disordered than proteins directly involved in assisting splicing catalysis. Conserved disordered regions in spliceosomal proteins are evolutionarily younger and less widespread than ordered domains of essential spliceosomal proteins at the core of the spliceosome, suggesting that disordered regions were added to a preexistent ordered functional core. Finally, the spliceosomal proteome contains a much higher amount of intrinsic disorder predicted to lack secondary structure than the proteome of the ribosome, another large RNP machine. This result agrees with the currently recognized different functions of proteins in these two complexes.
In eukaryotic cells, introns are spliced out of proteincoding mRNAs by a highly dynamic and extraordinarily plastic molecular machine called the spliceosome. In recent years, multiple regions of intrinsic structural disorder were found in spliceosomal proteins. Intrinsically disordered regions lack stable native three-dimensional structure in solutions, which makes them structurally flexible and/or able to switch between different conformations. Hence, intrinsically disordered regions are the ideal candidate responsible for the spliceosome's plasticity. Intrinsically disordered regions are also frequently the sites of post-translational modifications, which were also proven to be important in spliceosome dynamics. In this article, we describe the results of a structural bioinformatics analysis focused on intrinsic disorder in the spliceosomal proteome. We systematically analyzed all known human spliceosomal proteins with regards to the presence and type of intrinsic disorder. Almost a half of the combined sequence of these spliceosomal proteins is predicted to be intrinsically disordered, and the type of intrinsic disorder in a protein varies with its function and its location in the spliceosome. The parts of the spliceosome that act earlier in the process are more disordered, which corresponds to their role in establishing a network of interactions, while the parts that act later are more ordered.
In eukaryotic cells and certain viruses that infect them, the coding sequences (exons) of most protein-coding genes are interrupted by noncoding regions (introns). Following the transcription of an entire gene into a precursor messenger RNA (pre-mRNA), the introns are excised and the exons are spliced together to form a functional mRNA. The splicing reaction is catalyzed by a large macromolecular ribonucleoprotein (RNP) machine termed the spliceosome. The most common form of the spliceosome is composed primarily of five small nuclear RNA (snRNA) molecules: U1, U2, U4, U5 and U6, and 45 proteins, arranged into snRNP particles. Seven mutually related Sm proteins are common to all spliceosomal snRNP apart from the U6, which contains a set of related “like-Sm” (Lsm) proteins
Apart from the snRNP proteins, approximately 80 proteins are abundant in the human spliceosome and reported to be essential to the process of spliceosome-dependent splicing
A rare class of introns exists (<1% of all introns in human) that are excised by the so-called minor spliceosome
The primary activity of the spliceosome, i.e. the excision of introns and ligation of exons, requires the correct working of several additional functionalities of the spliceosomal machinery: recognition of the 5′ and 3′ splice sites (intron/exon definition), mutual recognition of spliceosome subunits and correct spliceosome assembly, spliceosome remodeling and regulation (review:
Splicing has been associated with intrinsic protein disorder
As they lack tertiary structure under many or all conditions, IDRs are more flexible and plastic than the rigid structures of globular domains. Disorder may increase the speed of intermolecular binding and unbinding and make interactions weaker
The subject of intrinsic disorder of the spliceosome has not yet been systematically analyzed for the entirety of the spliceosomal proteome. As an essential step towards broadening our understanding of the functioning of the spliceosome, we have carried out a bioinformatics analysis of intrinsic disorder within the human spliceosomal proteome. We discovered that almost half of the residues within the human spliceosomal proteins are disordered, and that the distribution of intrinsic disorder is uneven across the spliceosome. The spliceosome is divided into three layers: a rigid inner core that performs the precise operations required to effect splicing catalysis, a middle layer of disorder that acquires structure in spliceosome-bound proteins, and a fluid outer layer of disordered regions that do not acquire structure and that are responsible for the establishment of a matrix of weak interactions in the initial stages of the splicing process.
Initially, we predicted the average intrinsic disorder content of 122 core proteins of the major human spliceosome, including all abundant proteins
An intrinsic disorder content estimate of 44.0% is twice the average value for all human proteins as calculated on the basis of genome-based predictions, which is 21.6%
To determine whether there was any variation of disorder content throughout the complexes forming the spliceosome at different stages of the splicing reaction, we analyzed the fraction of predicted intrinsic disorder for different groups of proteins of the spliceosome complex. For this analysis, we divided the spliceosome proteins in our dataset into several groups based on proteomics data as well as included eight proteins of the U11/U12 di-snRNP of the minor spliceosome (
Different groups of spliceosome proteins differ in their predicted disorder content (
In deeper shades are marked the values for all proteins of the snRNP subunits of the major spliceosome (“snRNP proteins, major spl.”) and for all the proteins of the major spliceosome (“all proteins, major spl.”). The orange line indicates means calculated per-protein (disorder fraction was calculated for each protein first, and then a mean was taken out of this) while the green line indicates means calculated per-residue (the number of all disordered residues in a protein group divided by the total length of proteins in the group). Per-residue means are indicated above the line. Spliceosome protein groups are ordered according to per-residue means.
As no external standardized annotation scheme was available for IDRs in the spliceosomal proteins, we developed a classification based on their predicted primary and secondary structure features. We divided the spliceosomal IDRs into three classes: regions with consistently predicted secondary structure (SS) elements (henceforth “disorder with SS” or “IDR with SS”), long (≥25 residues) compositionally biased IDRs without predicted secondary structure elements (henceforth “compositionally biased disorder/IDR”), and other IDRs, which we omitted from further analyses (
Having annotated the IDRs, we analyzed the distribution of different types of disorder across different groups of human spliceosome proteins. Different groups of spliceosome proteins are predicted to differ in the type of disorder they contain (
Compositionally biased disorder (Y-axis) vs. disorder with SS (X-axis). Datapoints are colored according to predicted total per-residue disorder content. Groups of all proteins of the major spliceosome and all proteins of the snRNP subunits of the major spliceosome are indicated in bold.
Among different types of compositionally biased disorder, RS-like IDRs are found in all groups of early proteins, while poly-P/Q and miscellaneous noncharged IDRs are predicted to be concentrated mainly in the U1, U2, U11/U12 and U2-related proteins. Domain-length (≥100 residues) hnRNP-type G-rich regions are found only in hnRNP proteins, but short (<100 residues) hnRNP-like G-rich regions are found, in addition to SR and Sm proteins, in A-complex and U2-related proteins (
In contrast to early proteins, proteins of the later stages of splicing are often predicted to contain high amounts of disorder with SS. These proteins include proteins of the U5 snRNP and U4/U6 di-snRNP, proteins specific to the U4/U6.U5 tri-snRNP entity, hPrp19/CDC5L, step 2 catalytic factors, as well as B, B-act and C-complex proteins. Most of these protein groups are also predicted to be relatively ordered. In particular, for the isolated proteins of the U5 snRNP, which is predicted to be the least disordered of all the snRNP subunits, over a half of the disordered residues are predicted to be in IDRs with SS. We suggest that, in the case of proteins of larger complexes, disorder with SS may acquire structure as the individual proteins of the complex come together. If so, the U5 snRNP may be almost completely ordered when the proteins come together in the complex. For the highly disordered U4/U6.U5 tri-snRNP-specific proteins, high disorder content coupled with a high content of disorder with SS suggests a high potential for structure variability. We suggest that this potential is exercised upon the assembly and disassembly of the tri-snRNP. Among compositionally biased IDRs, only RS-like domains are commonly found in the late proteins. Between proteins of the U4/U6.U5 tri-snRNP, step 2 catalytic factors and the abundant B, B-act and C complex stage-specific proteins, we identified 12 RS-like IDRs, including a single RS-like IDR in the central part of the U4/U6 di-snRNP protein U4/U6-90K and the RS-like IDR on the N terminus of the U5 snRNP protein U5-100K
We repeated our IDR analysis for 122 additional proteins consistently found in the results of proteomics analyses of the major spliceosome (
For most protein groups, adding non-abundant proteins changed IDR content values by less than 10% of the respective lengths of proteins involved (
Blue bars indicates values of intrinsic disorder content for core proteins, green bars for both core and additional spliceosome proteins. The blue and green lines indicate means for given protein groups, calculated per-residue. In deeper shade, values for all core (blue) and all (green) proteins associated with the major spliceosome.
Some auxiliary proteins, such as the two RS-like IDR-rich splicing coactivators SRm160/300, are both extremely long and extremely disordered (SRm300: 2752 residues, predicted 98.1% disorder content). In this particular case, the SRm160/300 proteins are thought to form a matrix promoting interactions between splicing factors
We next considered the association of post-translational modifications (PTMs) of human spliceosomal proteins with intrinsic disorder. To do so, we compared our data on IDR distribution throughout the human spliceosomal proteome with PTM data from UniProt
82.6% of all PTMs of spliceosomal proteins found in UniProt are phosphorylations (
Modification | Structural order | Disorder with SS | RS-like | Poly-P/Q | hnRNP-like G-rich | Noncharged | Charged | Other disorder | Total | Percent |
Phosphorylation (*) | 158 | 326 | 572 | 137 | 82 | 43 | 49 | 412 | 1779 | 82.6% |
Lysine N-acetylation | 127 | 30 | 12 | 4 | 6 | 0 | 3 | 27 | 209 | 9.7% |
Other N-acetylation (**) | 14 | 20 | 1 | 0 | 1 | 2 | 2 | 44 | 84 | 3.9% |
Arginine methylations (***) | 5 | 2 | 13 | 4 | 42 | 2 | 0 | 6 | 74 | 3.4% |
Lysine methylations (****) | 3 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 6 | 0.3% |
Cysteine methyl ester | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.0% |
(*) S,T and Y phosphorylation.
(**) N-terminal acetylation of MGASTV.
(***) Includes the keywords “dimethylarginine”, “asymmetric dimethylarginine”, “omega-N-methylarginine”.
(****) Includes the keywords “N6-methyllysine”, “N6, N6-dimethyllysine”, “N6, N6, N6-trimethyllysine”.
To further analyze the possible roles of disorder that may acquire structure in the human spliceosome, we considered three sources of information: data from experimentally determined structures available in the Protein Data Bank (PDB)
We browsed the experimentally determined structures of spliceosomal protein complexes to find out which regions predicted to be disordered in isolation were found to be ordered in a complex. Short disordered ligand peptides (<30 residues) that acquire structure upon binding larger partners are called Molecular Recognition Features (MoRFs)
Region | Type | Protein | Region | Protein group | Partner (*) | Predicted ordered/disordered status in isolation | Structure | Reference |
N-U1snRNP70_N | MoRF | U1-70K | 8–22 | U1 snRNP | U1-C (zf-U1) | disordered, next to ordered helix | 3CW1 |
|
C-U1snRNP70_N | short, RNA-binding | U1-70K | 63–89 | U1 snRNP | U1 snRNA | disordered | 3CW1 |
|
ULM (**) | MoRF | SF3b155 | 333–342 | U2, SF3B | SPF45 (UHM) | disordered | 2PEH |
|
ULM | MoRF | U2AF65 | 90–112 | U2 snRNP-related | U2AF35 (UHM) | disordered | 1JMT |
|
ULM | MoRF | SF1 | 13–25 | A-complex (***) | U2AF65 (UHM) | disordered | 1O0P |
|
SF3b1 | MoRF | SF3b155 | 377–415 | U2, SF3B | SF3b14a/p14 (RRM) | partially ordered | 2F9D |
|
SF3a60_bindingd | Domain-length | SF3a60 | 71–106 | U2, SF3A | SF3a120 (Surp) | partially ordered | 2DT7 |
|
PRP4 | Domain-length | U4/U6-60K | 107–137 | U4/U6 di-snRNP | U4/U6-20K | partially ordered | 1MZW |
|
PRP4 (****) | Domain-length | Prp18 | 77–115 | step 2 factors | ordered | 2DK4 | ||
Btz | Domain-length | MLN51 | 169–196, 215–230 | EJC | EIF4A3 | disordered, next to ordered helix | 2J0S |
|
(*) Domain names in brackets.
(**) ULMs correspond to the ELM motif LIG_ULM_U2AF65_1, defined by the pattern [KR]{1,4}[KR]-x{0,1}-[KR]W-x{0,1}.
(***) Non-abundant A-complex protein.
(****) The PRP4 region of Prp18 is ordered and its structure in isolation was solved. It is included in the table since the PRP4 region of U4/U6-60K is predicted to be partially disordered.
Other recognition regions (U1snRNP70_N, SF3a60_bindingd, SF3b1, PRP4, Btz, all of which we labeled after PFAM regions) are found in complexes present at various stages of the splicing reaction. Notably, the U1snRNP70_N region encompasses two subregions, the C-terminal of which is the only predicted disordered region shown through an experimental structure to bind RNA. Via a profile search, we found two additional candidate regions for the Btz motif and one additional candidate PRP4 region. The candidate Btz regions are found in TRAP150, an abundant A-complex protein, and its paralog BCLAF1, a low-abundance pre-mRNA/mRNA-binding protein that has been implicated in a wide range of processes
To find other potential domain-length recognition motifs in spliceosomal proteins, we considered the PFAM domains that mapped to predicted IDRs. We found 51 such PFAM domains (
Notably, when we compared the list of disordered PFAM domains with the list of the most disordered proteins in the spliceosomal proteome, we found that this group includes two out of three U4/U6.U5 tri-snRNP-specific proteins (U4/U6.U5-27K and 110K), as well as several conserved proteins associated with the B, B-act and C complex (e.g. MFAP1, RED, GCIP p29) that are also abundant in the human spliceosomal proteome
Abundance | Protein | Disorder fraction | PFAM domains | Group |
Abundant | SPF30 | 80.3% | SMN | U2 snRNP-related |
U4/U6.U5-110K | 87.9% | SART-1 | U4/U6.U5 trisnRNP | |
U4/U6.U5-27K | 76.8% | DUF1777 | U4/U6.U5 trisnRNP | |
CCAP2 | 78.2% | Cwf_Cwc_15 | hPrp19/CDC5L | |
TRAP150 | 100.0% | A-complex | ||
MFAP1 | 79.3% | MFAP1_C | B-complex | |
RED | 79.5% | RED_N, RED_C | B-complex | |
MGC23918 | 100.0% | cwf18 | B-act complex | |
HSPC220 | 84.8% | Hep_59 | C-complex | |
GCIP p29 | 93.0% | SYF2 | C-complex | |
Non-abundant | U11/U12-59K | 91.1% | U11/U12 | |
Npw38BP | 93.8% | Wbp11 | hPrp19/CDC5L | |
MLN51 | 100.0% | Btz | EJC | |
pinin | 92.3% | Pinin_SDK_N, Pinin_SDK_memA | EJC | |
MGC13125 | 93.5% | Bud13 | RES | |
C19orf43 | 88.6% | A-complex | ||
FLJ10154 | 100.0% | A-complex | ||
CCDC55 | 100.0% | DUF2040 | B-complex | |
CCDC49 | 100.0% | CWC25 | B-complex | |
PRCC | 100.0% | PRCC_Cterm | B-act complex | |
DGCR14 | 86.1% | Es2 | C-complex | |
DKFZP586O0120 | 100.0% | DUF1754 | C-complex | |
FLJ22626 | 100.0% | SynMuv_product | C-complex | |
LENG1 | 100.0% | Cir_N | C-complex | |
BCLAF1 | 100.0% | pre-mRNA/mRNA-binding |
Entries in this table fulfill simultaneously two conditions: they have a predicted disorder content >75%, and do not contain any PFAM domains that correspond to ordered structural domains.
As spliceosomal proteins found in human are typically conserved throughout eukaryotes
The majority of both ordered and disordered PFAM domains were present in LECA (
ordered domains | disordered domains | |||||
all proteins | abundant proteins | U4/U6.U5 tri-snRNP (*) | all proteins | abundant proteins | U4/U6.U5 tri-snRNP | |
all domains | 124 | 86 | 29 | 46 | 24 | 5 |
domains found in LECA | 121 | 86 | 29 | 36 | 22 | 5 |
domains found in prokaryotes (**) | 47 (37.9%) | 34 (39.5%) | 19 (65.5%) | 1 (0.0%) | 0 (0.0%) | 0 (0.0%) |
(*) Including the LSM domain present in Sm and Lsm proteins.
(**) In >100 copies.
As the final step of our analysis, we compared the fractions and distributions of intrinsic disorder in the proteomes of the subunits of the human major spliceosome and the human and the
Our comparison revealed a number of similarities and differences between the proteins of the human snRNP subunits and both ribosomes (
Feature | Ribosome, |
Ribosome, human | Major spliceosome, snRNP subunits, human |
Number of proteins | 54 | 80 | 45 |
Maximum protein length (aa) | 557 (S1) | 427 (L4) | 2335 (U5-220K/hPrp8) |
Mean protein length (aa) | 132 | 170 | 453 |
Fraction of predicted disorder (% of the combined lengths of proteins) | 37.7% | 47.0% | 34.1% |
Number of proteins with at least one IDR ≥30 residues | 28 | 61 | 28 |
Number of proteins with at least one IDR ≥70 residues | 1 | 19 | 23 |
Mean IDR length (aa) | 28 | 39 | 93 |
Fraction of predicted disordered residues with secondary structure (% predicted disorder) | 66.6% | 64.0% | 41.9% |
Number of non-PSE IDRs ≥70 residues | 0 | 3 | 15 |
Fraction of predicted disordered residues found in the crystal structure of the complex (% of predicted disorder) | 98.9% | — | <10% (U1 snRNP) |
Minimal and maximal fractions of predicted disordered residues for individual subunits | 34.8% (small subunit) - 40.0% (large subunit) | 39.1% (small subunit) - 52.2% (large subunit) | 20.1% (U5 snRNP) - 65.5% (U1 snRNP) |
Maximum RNA length (nt) | 2904 (23S) | 5070 (28S) | 188 (U2 snRNA)(*) |
RNA fraction of total weight (% total weight) | 65.2% | 60.3% | 8.2% |
(*)
The inspection of crystal structures confirms the predicted differences. 98.9% of predicted disordered residues of 51
As described in the
Although, in percentages, both the ribosomes and the spliceosome contain a similar amount of SS disorder, so far, there is very little structural evidence for the “mortar” function of the proteins of the spliceosome. We found only one predicted disordered region confirmed to bind RNA in all experimental structures of the spliceosome (C-terminal part of the U1snRNP70_N region,
The spliceosome has been called a “molecular machine”
In this work, we made multiple predictions regarding individual regions of human spliceosomal proteins as well as systematically analyzed the fraction, distribution and types of disorder across the various spliceosomal components. Summarizing, we found that the spliceosome, far from being a uniformly ordered machine, can be divided into three layers:
An inner layer, which best fits the definition of a “machine”. It includes the ordered cores of U2 snRNP SF3B, U4/U6 di-snRNP and U5 snRNP, as well as the Sm proteins of U1 snRNP and ordered C termini of the catalytic helicases. This layer also includes snRNAs. Proteins from this layer mainly assist the catalysis of the splicing reaction, and publications regarding this layer stress relatively precise mechanisms, such as kinetic proofreading
A middle layer, which is associated mostly with “structured” disorder (disorder with SS). It contains an abundance of domain-length disordered recognition motifs, disorder with predicted secondary structure that can act as, e.g., preformed structural elements and/or dual personality disorder, and long, highly disordered proteins with conserved disordered regions. Spatiotemporally, this layer is associated with U4/U6.U5 tri-snRNP-specific proteins, and B, B-act and C-complex non-snRNP proteins. Functionally, this layer is associated with spliceosome assembly, catalytic activation and dynamics. Many of these regions are phosphorylated. In addition to disorder with SS, this layer is also associated with some RS-like IDRs that function in splicing dynamics, such as
An outer layer, which is associated with mostly “unstructured” disorder. It is enriched in regions of long, compositionally biased disorder that may function as sensors that the spliceosome extends to the surrounding environment. These regions contain interaction sites such as RS-like IDRs, hnRNP-like G-rich regions, polyproline regions and ULMs. They may interact with each other, or with small ordered structural domains such as the Tudor domain (bound by hnRNP-like G-rich regions) and GYF domain (bound by polyproline regions). On the other hand, small RNA-binding domains present in this layer, such as RRM (RNA Recognition Motif) and PWI, may aid in the binding of the substrate pre-mRNA. The function of this layer is regulated by phosphorylation (e.g. in RS-like IDRs) and methylation (e.g. in hnRNP-like G-rich regions). Spatiotemporally, this layer is associated with early (A-complex, U1, U2 SF3A, U11/U12, U2-related) proteins, with SR, hnRNP proteins, and SRm160/300 proteins, and with RES complex proteins. Functionally, this layer is associated with early recognition, intron/exon definition, and alternative splicing regulation processes.
Full understanding of spliceosome activity requires information about each of its elements, at different functional stages
We provide the proteins and positions of all types of compositionally biased disordered regions in spliceosomal proteins. Based on the colocation of two types of disordered regions (RS-like and G-rich), we suggest that these regions may interact with each other. As these two types of disordered regions are found in multiple proteins throughout the human spliceosomal proteome, we also suggest the possibility that many more human spliceosomal proteins interact nonspecifically with each other and the RNAs than previously suggested. Large-scale deletions of compositionally biased regions may suggest essential subsystems of this interaction network;
We found that arginine methylation in spliceosomal proteins is associated with intrinsically disordered regions. We also suggest that arginine methylation and serine phosphorylation act in step to regulate the interaction network based on compositionally biased disordered regions. The elucidation of the effect of post-translational modifications, such as conformational transitions and molecular interactions that depend on the introduction or removal of particular modifications, can also lead to an improved understanding of regulatory mechanisms;
We provide candidate ULM sequences that can bind known and predicted UHM domains throughout the early stages of splicing. These sequences may participate in the regulation of particular instances of splicing;
We suggest several abundant conserved proteins found in the later stages of splicing that may function as “hub” proteins (e.g. MFAP1, GCIP p29, U4/U6.U5 tri-snRNP proteins). Targeted deletions of ordered motifs within these proteins may reveal regions responsible for the formation of particular spliceosomal complexes, their rearrangements, and interactions with regulatory factors.
Our prediction that more than one-third of the residues of the snRNPs are disordered has significant implications for the structural studies of the spliceosome. While much progress has been achieved in the determination of global shapes of various spliceosomal assemblies by cryoEM
Spliceosome proteins with GI identifiers supplied in
Initial predictions of intrinsic disorder were carried out using the GeneSilico MetaDisorder server (
In disorder with SS, the disordered region is predicted to contain one or both types of canonical α and β SS elements. The predicted secondary structure may be either pre-formed in the disordered state or appear only upon the formation of a stable structure, e.g. upon binding to another molecule. This type of disorder also at times contains short ordered regions (
IDR class | Description | Number of regions | Mean length | Compositional bias |
disorder with SS | contains secondary structure | 95 (predicted to contain coiled coils), 115 (other types) | 64 aa (predicted to contain coiled coils), 55 aa (other types) | RKDE with additional MQW (predicted to contain coiled coils), no rule (other types) |
compositionally biased, RS-like | biased towards arginine and serine residues | 35 | 65 aa | RS |
compositionally biased, polyP/Q | noncharged with poly P/Q (P/Q(n), n≥3)) repeats | 17 | 138 aa | PQMGVWA |
compositionally biased, hnRNP G-rich | contains RGG and related repeats ([RSY]GG, R[AGT][AGTFIVR]) (*) | 4 (hnRNP proteins), 10 (other proteins) | 145 aa (hnRNP proteins), 56 aa (other proteins) | GRY |
compositionally biased, noncharged | biased towards noncharged residues | 16 | 45 aa | PQMGVWA |
compositionally biased, charged | biased towards charged residues | 9 | 57 aa | RKDE |
(*)
We defined regions of disorder with SS (predicted intrinsic disorder with predicted secondary structure elements) as regions for which simultaneously the majority of intrinsic disorder prediction methods on the MetaServer gateway yielded predictions of disorder and the majority of secondary structure prediction methods yielded predictions of secondary structure elements. Multiple closely spaced secondary structure elements (connected by loops <20 residues) in a predicted disordered region were treated as elements of a single IDR with SS. If an IDR was predicted to contain α-helical elements and coiled-coil prediction methods aggregated on the MetaServer also yielded a prediction, the IDR was classified into the special class of disorder with coiled coils.
In compositionally biased disorder, the amino acid composition of the region deviates highly from the usual. We estimated compositional bias based on the absolute frequencies of occurrence of residues, compared to their usual frequency in vertebrates, as reported on the website
For several types of compositionally biased IDRs with a previous description in literature, we sought to define relevant standard IDR subclasses within our classification (
RS-like: IDRs that are rich in arginine and serine residues. These regions were shown to be intrinsically disordered
polyP/Q: IDRs that contain repeats of proline or glutamine residues. polyP/Q regions are capable of generating type II poly-P or poly-Q helices
hnRNP-like G-rich: IDRs that contain RGG and related repeats ([RSY]GG, R[AGT][AGTFIVR]) that can be classified as short (≤100 residues) and long ones. These regions are predicted to have low solvent accessibility (
We also developed two additional subclasses of compositionally biased IDRs to complement these classes of compositionally disordered IDRs:
“noncharged” disorder, which is rich in noncharged residues (PQMGVWA);
“charged” disorder, which is rich in charged residues (RKDE). The “charged” compositionally biased disorder is similar to a type of disorder with SS that has predictions for coiled-coil secondary structure.
Site identifiers of 2153 known or possible post-translational modifications, including 720 modifications of the 122 core proteins, were downloaded from UniProt
Assignment of boundaries for hnRNP-like G-rich regions and for positions of candidate ULMs was based on pattern analysis. For hnRNP-like G-rich regions, the following patterns were used: [RSY]GG-x{1,50}-[RSY]GG-x{1,50}-[RSY]GG; R[AGT][AGTFIVR]-x{1,25}-RGG-x{1,25}-R[AGT][AGTFIVR]. For ULMs, the following pattern was used: [RK]{1,}-[RK]-x{0,1}-[RK]{1,}-x{0,1}-W-x{0,2}-[DE]{1,}. The ULM consensus pattern was based on the sequences of known ULMs found in experimentally determined structures of ULM complexes. This stringent pattern does not retrieve all of the
PFAM IDs were assigned on the PFAM website
Although a crystal structure of a eukaryotic ribosome has been recently determined, many amino acid residues within this structure are unassigned
Disorder and binding disorder plots were generated using the ANCHOR server (
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
We thank Łukasz Kozłowski for help with his software, Adam Godzik and Christian Zmasek for the list of LECA domains, Ben Blencowe and Christos Ouzonis for help with RS domains. IK thanks Peter Tompa for the kind gift of his book on protein disorder. We thank Reinhard Lührmann, Elżbieta Purta, Anna Czerwoniec, Łukasz Kozłowski, Joanna Kasprzak, and Marcin Magnus for critical reading of the manuscript, useful comments and suggestions.