Skip to main content
Advertisement
  • Loading metrics

Accurate, high-coverage assignment of in vivo protein kinases to phosphosites from in vitro phosphoproteomic specificity data

Abstract

Phosphoproteomic experiments routinely observe thousands of phosphorylation sites. To understand the intracellular signaling processes that generated this data, one or more causal protein kinases must be assigned to each phosphosite. However, limited knowledge of kinase specificity typically restricts assignments to a small subset of a kinome. Starting from a statistical model of a high-throughput, in vitro kinase-substrate assay, I have developed an approach to high-coverage, multi-label kinase-substrate assignment called IV-KAPhE (“In vivo-Kinase Assignment for Phosphorylation Evidence”). Tested on human data, IV-KAPhE outperforms other methods of similar scope. Such computational methods generally predict a densely connected kinase-substrate network, with most sites targeted by multiple kinases, pointing either to unaccounted-for biochemical constraints or significant cross-talk and signaling redundancy. I show that such predictions can potentially identify biased kinase-site misannotations within families of closely related kinase isozymes and they provide a robust basis for kinase activity analysis.

Author Summary

Proteins can pass around information inside cells about changes in the environment. This process, called intracellular signaling, helps to trigger appropriate cellular responses to environmental changes. One of the main ways information is passed to proteins is through chemical “tagging,” called phosphorylation, by enzymes called protein kinases. We can measure the phosphorylation state of practically all proteins in a cell at any moment. Starting from known cases of phosphorylation by a kinase, many computational methods have been developed to predict if the kinase might tag a certain spot on another protein or if an observed tag was attached by the kinase, with different models for each kinase. I have developed a new method that instead uses a single model to assign one or more kinases to each observed tag, built from the latest large-scale experimental data. This change in focus and unbiased training data allows my method to be significantly more accurate than past methods. I also explored useful applications for my method. For example, I used it to show that much of our knowledge about which kinase is responsible for each tag is probably inaccurately biased towards the commonly studied ones.

This is a PLOS Computational Biology Methods paper.

Introduction

Protein phosphorylation is the most common form of post-translational modification and it plays a central role in intracellular signaling. Diverse protein kinases catalyze the binding of a phosphate group to a substrate acceptor residue, typically serines (S), threonines (T) or tyrosines (Y) in eukaryotes. The active sites of kinases’ enzymatic domains exhibit phosphoacceptor-residue specificity, which can be broadly classified in eukaryotes as serine/threonine (S/T)-specific, tyrosine (Y)-specific, or so-called “dual-specificity” kinases. Kinase substrate-specificity is further determined by the protein primary and secondary structural contexts around the phosphoacceptor residue, as well as by allosteric structural mediation of docking [1, 2].

The sequence contexts around known phosphorylation sites (“phosphosites”) have been widely used in computational approaches to predict new substrate sites of a protein kinase. Numerous methods have been developed to achieve this, employing, for example, scoring matrices [39], neural networks [1012], support vector machines [13, 14], sequence clustering [1517], kinase structure [18], or grammatical inference [19]. With the emergence of phosphoproteomics by liquid chromatography and tandem mass spectrometry (LC-MS/MS) enabling the routine detection thousands of phosphorylation sites in a single experiment [20], there is now little need to predict new phosphorylation sites in silico. The problem has instead changed to one of kinase-phosphosite assignment. Accordingly, new methods have emerged, based on models such as support vector machines [21, 22], multiple kernel learning [23], Naïve Bayes [24], networks [25, 26], and knowledge graphs [27]. Classical scoring matrices are often still used within these or related methods to model kinase substrate specificity [23, 25, 26, 28].

Two major challenges face modern kinase-substrate assignment methods. First, dependence on literature-derived annotations for model training is subject to biases towards more commonly studied protein kinases [29]. This strongly limits and biases the kinases for which assignments can be made, whereas phosphoproteomic data requires unbiased, kinome-scale assignment. Many phosphosite-prediction methods resolve this imbalance by making predictions at the level of kinase families, however these predictions will still be biased towards the features of the well-studied family members [29]. Given the increasing evidence of distinct functional roles even among closely related kinase isozymes (see, e.g., refs. [3033]), it is exigent that phosphorylation events be resolved at the level of individual kinases. The second challenge is that many sites can be phosphorylated by more than one kinase [34]. However, a lack of complete, “all-versus-all” kinase-site training sets has prevented assignment methods from having been constructed and evaluated in such a multi-label setting. As a result, the predictive performance for promiscuous or strongly specific kinases can dominate validation metrics calculated across all kinase assignments.

Here, I describe improvements in multi-label, kinase-phosphosite assignment over previous methods for a large subset of the human kinome. I achieve this without sophisticated machine-learning methods by focusing on the above considerations and building on the latest proteome-scale, hypothesis-free data. The result is a nested model called IV-KAPhE which is built on models of in vitro kinase specificity and in vivo functional association. First, I describe the approach used to construct the model. Then I consider the hypotheses produced by it and other computational methods about the topology of the human phosphorylation network and the utility of these methods in inferring protein kinase activity from quantitative phosphoprotoemic data.

Materials and methods

In vitro protein kinase specificity models

For detailed, mathematical descriptions of specificity model construction, see Supplemental Methods in S1 Text. They are described here in brief. All scoring-matrix models were built from 15-residue sequence windows around the phosphoacceptor (+/-7 residues). To reduce the influence of highly similar sequences, which likely reflect shared evolutionary history between substrates rather than kinase specificity, substrate sequences were weighted before counting [35]. Position-specific pseudocounts were added [36] and supplemented according to missing tail residues if sites localized to the 5’ or 3’ tail of the substrate protein. Column weights were calculated as the relative entropy versus background frequencies (position-frequency matrix (PFM) and phosphoproteome-backed position-specific scoring matrices (PSSMs): position-specific residue frequencies from the full set of observed sequence windows; proteome-backed PSSMs: proteomic residue frequencies). PFM and PSSM scores were min-max normalized by calculating the theoretical minimum and maximum scores produced by the PSSM [8].

The multi-label Naïve Bayes model was built using four components: the PFM constructed from a kinase’s substrates; the PFM of all the substrates not phosphorylated by the kinase; and the prior probabilities of a site being phosphorylated by the kinase or by any other kinase. The priors were empirically estimated as the fraction of substrates in the experiment that were phosphorylated by the kinase or not, respectively. For determining the posterior probability that a kinase K phosphorylates a site with sequence window S, given PFM scores sK for K and for all other kinases and prior probabilities P(K) and , the following was calculated:

A kinase was assigned to a site if P(K|S)>0.5.

The “Naïve Bayes+” model further incorporated Bernoulli-distributed features into the PFM-based likelihood function. Probability parameters of substrates being direct or indirect interaction partners, of carrying domains enriched among a kinase’s substrates or interaction partners, or of substrate sites being within a predicted protein domain were estimated empirically from the training data. Domain enrichment was calculated among substrates or interaction partners via Fisher’s hypergeometric test and p-values were adjusted for false-discovery rate before being tested at a critical value of 0.05. To facilitate the scoring of new human phosphosites, the Naïve Bayes+ model for human kinases, as generated and applicable by the motif-kit software (see “Implementation”, below), as well as the domain enrichment and kinase interaction data have been archived on Zenodo (doi: 10.5281/zenodo.6325198).

Multi-label cross-validation.

For multi-label performance evaluation, I performed 10 iterations of 10-fold cross-validation, restricting to kinases with models trained on at least 20 substrates. Both kinase substrate-domain enrichment and sequence specificity models were recalculated from each fold’s training subset. Performance was evaluated using macro-averaged (averaged over the set of all kinases, ) precision, recall and F1 scores: where TPK, FPK, and FNK are the true-positive, false-positive, and false-negative counts for kinase K, respectively. If a score was undefined due to division by 0, it was set to 0.

Implementation.

Kinase specificity model training and scoring were implemented in and performed using a bespoke, free and open-source toolkit called motif-kit (https://www.gitlab.com/brandoninvergo/motif-kit). Version 1.0, used here, is archived at Zenodo (doi: 10.5281/zenodo.6325136). The code was written in ANSI C99 for POSIX systems and is dependent only on the GNU Scientific Library (https://www.gnu.org/software/gsl) and the HDF5 library (https://www.hdfgroup.org/solutions/hdf5/), with unit tests further depending on the Check testing framework (https://libcheck.github.io/check/). All other methods (domain enrichment, multi-label CV, etc.) were implemented using R (version 4.1.0).

IV-KAPhE model

Model construction.

The IV-KAPhE model was built via the Random Forest method, as implemented in the R package “ranger” (version 0.13.1; [37]). The models were built with 500 trees. The variable importance mode (parameter “importance”) was set to “impurity” (the Gini index) and the splitting rule (parameter “splitrule”) was set to “gini”. The model was trained to classify a given kinase-phosphosite pair as “true” (the kinase phosphorylates the site) or “false” (it does not) and set to provide a probability for each class. Feature selection was performed via variable-importance analysis, as implemented in ranger. The final list of features retained for model construction were: Naïve Bayes+ posterior probability, GO BP and CC semantic similarity, STRING coexpression and experimental scores, whether the substrate protein is itself a kinase, the kinase type (S/T or Y), and the site class (S/T or Y). The model was also tested using a Support Vector Machine instead of Random Forest, as implemented in the R package “e1071” (version 1.7–9) using the default settings for classification.

Cross-validation.

In vivo kinase-substrate relationships identified in the PhosphoSitePlus database [34] were used as true positives in the training set for cross-validation. The same kinases from the true positive set (including multiple occurrences) were randomly assigned to other sites from the human phosphoproteome [38] to form a negative set. To this end, S/T kinases were randomly assigned to S/T sites and Y kinases were assigned to Y sites. Additionally, any S/T or Y kinases that were annotated as phosphorylating the opposite site type in the true positive set were randomly assigned to a similar proportion of such sites in the negative set. In all cases, the proportion of substrate kinases observed in the positive set was maintained in the random negative set. Sites were filtered not to include sites found in the true positive training set or the external testing set (see below). 10-fold cross validation was performed and evaluated, restricted to kinases with Naive Bayes+ models trained on at least 20 substrates, via multi-label precision, recall, and F1 as described above. Cross-validation was performed ensuring that any kinase present in a fold had at least one positive and one negative site.

External validation.

IV-KAPhE was trained using the full PhosphoSitePlus and random kinase-site pair training set as described above. An external evaluation set was prepared by identifying kinase-substrate relationships inferred via the ProtMapper method [39] which were not present in PhosphoSitePlus (in vivo or in vitro). These sites were accompanied by further random negative kinase-site pairs as described above. Predictions made by the model on this testing set were evaluated via multi-label precision, recall, and F1.

I furthermore evaluated the assignments for these kinase-site pairs made by phosphoproteome-backed PSSMs, Naïve Bayes+, and three other, previously published tools with similar kinomic scope or model architecture: NetworKIN 3.0 [40], GPS 5.0 [17], and LinkPhinder [27]. NetworKIN and GPS were run in-house with their default settings, whereas the LinkPhinder scores produced by the authors were used. For further evaluation, the published stringent LinkPhinder cutoff of 0.672 [27] and the nominal NetworKIN cutoff of 1.0 [40] were used.

The test set was then re-evaluated with all methods, now allowing all-versus-all kinase-site assignments, restricted to sites that had at least one true kinase assigned. The rate of novel assignments was estimated via the macro-averaged false discovery rate (FDR) and compared to the macro-averaged true positive rate (identical to recall):

“False positive” (FP) is used here by convention although it is a misnomer in the present case, as we do not know that the assignments are false.

Kinase assignment distributions

IV-KAPhE and Naïve Bayes+ were applied to perform all-versus-all assignments for kinases with specificity models built from at least 20 substrates against a union of the PhosphoSitePlus human phosphosite set [34] and a high-confidence set of human phosphosites [38]. The numbers of kinases assigned to each phosphosite and the median number of kinases assigned to sites on each substrate protein were analyzed via histograms. All literature-derived assignments in PhosphoSitePlus and all assignments provided by the authors of LinkPhinder at a cutoff of 0.672 were analyzed similarly.

Kinase activity analysis

Previously published quantitative phosphoproteomic measurements from a multi-inhibitor experiment [41] were filtered to remove missing data and measurements taken under serine/threonine-protein phosphatase 2A inhibition. If a phosphosite was observed on multiple peptides in the data, the peptide with the greatest dynamic range between conditions was retained. Protein kinases that are regulated downstream of the kinases targeted for inhibition were retrieved from two sources: Omnipath, a meta-database of protein-protein regulatory relationships [42], keeping only kinase-kinase regulatory relationships with a consensus sign (activating or inhibiting); and a set of computational predictions of signed kinase-kinase regulatory relationships [28], with a stringent posterior-probability cutoff of 0.75. Multi-label kinase-substrate assignment was then performed for all target kinases and each of their regulatory-substrate kinases using IV-KAPhE, NetworKIN 3.0, and LinkPhinder at their nominal cutoffs as described above and using GPS 5.0 at its “medium” stringency setting. Furthermore, in vivo kinase-substrate annotations were retrieved from PhosphoSitePlus for the sites.

For each kinase-substrate assignment source, kinase activities were inferred as follows. For a given kinase and inhibition condition, the log2 fold-changes of any of the kinase’s assigned substrates were tested for significant difference from the mean log2 fold-change for that condition via a two-sided Z-test. The final activity was inferred as , where sgn is the sign function, is the mean fold-change of the kinase’s assigned substrates, and p is the p-value of the Z-test. For example, if the kinase’s assigned substrates have a significantly lower distribution than the full sample, the inferred activity will take a large negative value.

Results

I developed IV-KAPhE over three stages. First, I sought the best sequence-based model to represent in vitro kinase sequence specificity. Next, I incorporated physical-interaction and structural factors that mediate phosphorylation under the in vitro, context-free conditions of the training data. Finally, I nested the results of the in vitro model with features for predicting kinase-substrate functional associations in a predictor of in vivo phosphorylation (Fig 1).

thumbnail
Fig 1. IV-KAPhE is a multi-label kinase-phosphosite assignment method that nests a multi-label model of in vitro assignment in a model of in vivo functional association.

Naïve Bayes+ consists of sub-models for each kinase, trained from kinase-specific, high-throughput in vitro kinase-substrate relationships. These sub-models together comprise a final, multi-label Naïve Bayes model. IV-KAPhE is a monolithic, multi-kinase Random Forest model trained from all literature-derived kinase-substrate annotations and random pairs as negative cases.

https://doi.org/10.1371/journal.pcbi.1010110.g001

Naïve Bayes is more appropriate than PSSM models for building a multi-label assignment method

To construct specificity models for a large fraction of the human kinome, I used results from a recent phosphoproteomic, in vitro assay of kinase specificity [43]. In this experiment, protein extracts from HeLa cells were first treated with a thermo-sensitive protein phosphatase and then spiked with a recombinant protein kinase. The phosphorylated extracts were digested and subjected to phosphoproteomic analysis by LC-MS/MS. This provided in vitro substrates for 349 protein kinases, ranging from 1 to 1672 substrates per enzyme, with 322 kinases having at least 20 substrates. As a benefit of using a single cell line, each kinase was exposed to approximately the same set of potential substrates, detectable within the limits of random sampling by shotgun proteomics.

With this data, I sought to identify the best-performing specificity model on which to build a kinase-substrate assignment method. I first considered three scoring matrix-based specificity models: the position-frequency matrix (PFM) alone, the position-specific scoring matrix (PSSM) with log-likelihood ratios of the PFM to proteomic residue frequencies, and the log-likelihood ratio PSSM backed by position-specific phosphoproteomic residue frequencies. I hypothesized that the phosphoproteome-backed PSSM would be most appropriate for the task. Next, I took advantage of the of the fact that the design of the experiment provides true-negative kinase-site assignments to reformulate PFM-based specificity as a multi-label Naïve Bayes model [44]. I expected that the additional prior probability information would further strengthen assignments over PSSM models.

Breaking from convention (e.g. ref. [8]), for performing multi-label kinase-substrate assignment I retained the central residue in scoring. This allowed me to assign S/T and Y kinases simultaneously and to handle dual-specificity kinases cleanly. To strengthen the distinction between assignment to S/T and Y kinases, I incorporated a position-weighting scheme into the models to provide greater scoring weight to highly resolved positions. I opted to use relative entropy for weighting as it provides better separation between well-defined and degenerate positions than information content does (Fig A in S1 Text). To simplify multi-label assignment, I aimed to use a single score cutoff for all kinases. Because each kinase’s PFM is unique, the scores produced by PFMs or PSSMs will have different theoretical and empirical ranges for each kinase (Fig B in S1 Text). As a result, an effective score cutoff for one kinase might be outside the theoretical range for another kinase. To overcome this, I min-max normalized the PFM and PSSM scores to be between 0 and 1 [8].

I evaluated the models via 10-fold cross-validation. As the goal is to have good performance across all kinases, I chose macro-averaged precision and recall (i.e. averaged across kinases) as evaluation metrics. I avoided the Receiver Operator Characteristic (ROC) analysis commonly used in single-label prediction because the strong imbalance between positive and negative cases per kinase would de-emphasize false positives and inflate the area under the curve [45]. Using these metrics, I found that raw PFMs performed poorly, while PSSM-based methods and Naïve Bayes showed overall similar performance (Fig 2a). Nevertheless, phosphoproteome-backed PSSMs out-performed proteome-backed ones, confirming that a phosphoproteome background is preferable for kinase-substrate assignment. They also performed slightly better than Naïve Bayes, against my expectations.

thumbnail
Fig 2. Naïve Bayes models of in vitro kinase-phosphosite assignment have important performance differences from PSSM-based methods.

a) PSSM methods and Naïve Bayes perform similarly in cross-validation of multi-label kinase-substrate assignment via macro-averaged precision versus recall. The expanded Naïve Bayes+ model outperforms the other methods. Points indicate the scores at the cutoff that maximizes that macro-F1 score. Black error bars showing 95% confidence intervals at these points are indiscernible in most cases, indicating highly robust performance across cross-validation folds. b) The macro-averaged F1 scores behave differently with score/probability cutoff for scoring matrix-based models versus Naïve Bayes. PSSM and PFM-based models require a strictly defined cutoff. Naïve Bayes+ again outperforms the others and retains the same flat relationship with cutoff as basic Naïve Bayes. Points indicate the maximum value. Bands indicate the 95% confidence interval. Color assignments are the same as in (a). c) Example score distributions for a S/T kinase (AKT1) and a Y kinase (FYN) from one round of cross-validation. For S/T kinases, Naïve Bayes probabilities are largely distributed close to 0.0 and 1.0 while PSSM scores take more intermediate values, notably including scores for Y sites. Y kinases show better separation for both methods. d) Left: Logistic curves relating phosphoproteome-backed PSSM scores to Naïve Bayes probabilities. Each curve represents a fitted logistic function for each kinase. The color of the curve represents the number of kinase substrates used to fit each specificity model. Right: The fitted logistic curve parameters versus number of substrates. S/T and Y kinases have negative relationships between inflection point and numbers of substrates. e) Min-max normalization of PSSM scores does not produce a stable inflection point independent of the number of substrates.

https://doi.org/10.1371/journal.pcbi.1010110.g002

The shapes of the precision-recall curves (Fig 2a) may appear strange compared to standard single-label curves. They can be explained first by the strong imbalance between positive and negative cases per kinase, producing very low precision at low cutoffs (i.e. at high recall). Second, at more stringent cutoffs (low recall) a growing subset of kinases is no longer assigned to any sites, in which case precision is undefined and set to 0 for each kinase so affected. Thus, the “hump” in the macro-averaged precision-recall curves (Fig 2a) represents the point at which further gains in precision from more stringent cutoffs are offset by not only lower average recall but also a reduction in effective kinome coverage.

Turning to the macro-averaged F1 score (macro-F1), which assesses the balance between precision and recall, we see different relationships between macro-F1 and cutoff (Fig 2b). Notably, macro-F1 scores of scoring-matrix methods peak at specific cut-offs before dropping precipitously, whereas Naïve Bayes gives a consistent score across much of the cutoff range. This can be explained by observing the distributions of scores or probabilities produced by the different methods (Fig 2c). While the scoring-matrix methods produce many intermediate scores (particularly for S/T kinases), Naïve Bayes mostly assigns probabilities close to 0 or 1 (Fig 2c). Moreover, the Naïve Bayes model better rejects inappropriate phosphoacceptors (e.g. Y phosphoacceptors for S/T kinases) in this way (Fig 2c). Thus, the PFM and PSSM-based models are particularly sensitive to the chosen score cutoff.

Naïve Bayes has the convenient property of a well-defined probability cutoff for assignment for all kinases (P > 0.5; see Supplemental Methods in S1 Text). No such definition exists for the PSSM model. Given the PSSM models’ sensitivity to cutoff selection, I sought to determine a robust selection method. As illustrated in Fig 2c, there is a sigmoidal relationship between phosphoproteome-backed PSSM scores and Naïve Bayes posterior probabilities, which could possibly be used to produce a PSSM score cutoff analogous to 0.5 posterior probability. Indeed, the phosphoproteome-backed PSSM score can be approximated by a logit transformation of the Naïve Bayes posterior probability (see Supplemental Methods in S1 Text). This approximation includes a dependency on the number of substrates used to fit a model, such that the 0.5-equivalent cutoff decreases with increasing number of substrates. In other words, because of varying training-set sizes, the log-likelihood PSSM score at which the foreground evidence effectively outweighs the background evidence also varies, whereas this is stabilized through normalization against total probability in Naïve Bayes.

To verify this, for each kinase I fit a logistic curve to their PSSM scores versus Naïve Bayes posterior probabilities calculated on a high-confidence set of human phosphosites [38] (Fig 2d). The fitted inflection point provides the PSSM score that is equivalent to a posterior probability of 0.5. As predicted, I observed a strong dependence between this inflection point and the number of substrates used to fit the models, which differed with kinase type. I then checked whether min-max normalization of the PSSM scores remedies the problem (Fig 2e). For S/T kinases, the inflection points were generally around 0.75, which is close to the observed macro-F1-maximizing cutoff of 0.803 (Fig 2b). However, both kinase types still had a decreasing inflection point with increasing number of substrates.

Together, these results suggest first that a raw PFM (as used in, e.g., ref. [22]) is the weakest model. PSSMs perform best with a position-specific phosphoproteome background, but they require kinase-specific cutoffs, even after normalization, for multi-label assignment. The Naïve Bayes model offers good performance and a stable and universal cutoff. It is thus a better foundation for incorporating other features.

Physical-interaction and structural features improve in vitro Naïve Bayes predictive performance

Kinases can be optimized to bind with their substrates by physical interactions outside the enzymatic kinase domain’s active site, and structural features of the substrate might further impact these interactions. These would remain relevant in the in vitro experimental environment and thus may be exploited to improve the model. Given the Naïve Bayes assumption of feature independence, it is trivial to incorporate other features. Thus, I constructed a second Naïve Bayes model (“Naïve Bayes+”) with added features based on: proteins physically interacting, directly or indirectly, with the kinase; proteins carrying a domain that is enriched among the kinase’s substrates or interactors; and the phosphosite being within a protein domain. These features were modeled as Bernoulli-distributed, and each showed an overall difference across the kinome in empirical probability among substrates versus non-substrates (Fig 3). As with calculating residue frequencies in substrate sequences, these empirical probability estimations may be impacted by stochastic detection of low-abundance substrates by LC-MS/MS, particularly for kinases with few substrates.

thumbnail
Fig 3. Additional predictive features of in vitro kinase substrates can be discriminated in the training data.

Distributions of empirical probabilities are represented by the kernel density estimate of the kinases’ respective Bernoulli probability parameters. Each feature trends towards higher probability in in vitro substrates than in other sites.

https://doi.org/10.1371/journal.pcbi.1010110.g003

In cross-validation, the Naïve Bayes+ model produces superior macro-averaged precision and recall than the other methods (Fig 2a). Its F1 score is similarly higher, while exhibiting the same flat behavior as the sequence-only Naïve Bayes model (Fig 2b). Given this improvement in performance I carried the Naïve Bayes+ model forward for nesting into the in vivo model.

IV-KAPhE produces accurate multi-label assignment of in vivo kinases to phosphosites

I constructed an in vivo kinase-substrate assignment method using the Random Forest model on five features for protein-protein functional association (Fig 1): the kinase-specific Naïve Bayes+ posterior probability for the site; the semantic similarities between the Gene Ontology (GO) “biological process” (BP) and “cellular component” (CC) annotations of kinase and substrate; and coexpression and high-throughput experimental scores between kinase and substrate from the STRING database [46, 47]. Other features from STRING were discarded either by feature importance analysis (gene fusion, genome co-occurence, conserved neighborhood) or to avoid potential circularity with the training set (database imports, text-mining, combined score). I added three further features to strengthen possible discrepancies in performance (Fig 1): whether or not the substrate is itself a kinase, necessary because the functional association scores are symmetrical and can therefore produce false positives if the enzymatic roles are, in fact, reversed; whether the kinase is a S/T or Y kinase, to account for kinase-type differences in Naïve Bayes+ probability distributions; and the phosphoacceptor type (S/T or Y), for similar reasons. Because there are relatively few features and none of these require any special consideration, other machine learning models could be substituted in for in vivo prediction. Here, I chose Random Forest for its simplicity and for its facility in the analysis of feature importance.

For brevity, I will call this model “IV-KAPhE” (“In Vivo-Kinase Assignment for Phosphorylation Evidence”). For training IV-KAPhE, I took advantage of the fact that all kinase-specific substrate specificity information was used when training the underlying Naïve Bayes+ model. I thereby trained IV-KAPhE as a monolithic, non-kinase-specific model on the 7322 human in vivo kinase-substrate relationships annotated in the PhosphoSitePlus database [34] and on an equally sized set of random assignments of kinases to human phosphorylation sites [34, 38] as negative cases. During training, feature importance analysis revealed the Naïve Bayes+ posterior probability, the STRING “experimental” score, and GO BP semantic similarity to be the most important (Fig 4a). The features accounting for substrate kinases, S/T versus Y kinases, and S/T versus Y sites carried low importance but their omission worsened performance, particularly for Y kinases.

thumbnail
Fig 4. IV-KAPhE performs well in cross-validation and significantly outperforms previously published methods on an external validation set.

a) Naïve Bayes+ posterior probability, GO BP semantic similarity, and STRING experimental score had the greatest importance when training the Random Forest models. Error bars show standard error across cross-validation runs. b) Predictive performance of individual quantitative features, as assessed by average macro-precision and macro-recall across 10 folds of the training data, reveals GO BP semantic similarity and STRING experimental score as being the most predictive individual features. c) Cross-validation evaluated via macro-averaged precision, recall and F1 all reflect strong performance by IV-KAPhE. d) IV-KAPhE’s coverage of the external test data set is similar to LinkPhinder’s but is lower than that of GPS 5.0. e) Kinase-specific F1 scores reveal IV-KAPhE’s consistently strong performance across most kinases, with similar performance for S/T and Y kinases, compared to other methods. f) IV-KAPhE outperforms the simpler PSSM-based and Naïve Bayes+ methods as well as other previously published methods in kinase-substrate assignment of an external validation set. Points indicate the scores for simple assignments (GPS) or the scores at nominal cutoffs for quantitative predictions (cutoffs—IV-KAPhE: 0.5, PSSM: 0.75, Naïve Bayes+: 0.5, LinkPhinder: 0.672 [27], NetworKIN 3.0: 1.0 [40]). Error bars show the 95% confidence intervals at these points. g) IV-KAPhE has a higher macro-averaged F1 score than the other methods. Points and color assignments are as in (e). Bands indicate the 95% confidence interval. h) IV-KAPhE similarly outperforms the other methods in Receiver Operating Characteristic (ROC) curve analysis for this balanced test set. Points and color assignments are as in (e). Error bars show 95% confidence intervals. i) Focusing on multi-label assignment for sites in the test set with known kinases, the macro-averaged false discovery rate (FDR; i.e. rate of novel assignments) dominates the average true positive rate (TPR). The curves are similar for most methods. At its nominal cutoff, IV-KAPhE has the second-highest FDR, but it is matched by the highest TPR.

https://doi.org/10.1371/journal.pcbi.1010110.g004

While IV-KAPhE was built as a multi-label method, the training set does not represent a full, all-versus-all compendium of kinase-phosphosite assignments among the represented kinases and sites. This prevents full evaluation of the method in a true multi-label setting. I nevertheless evaluated it and the individual features using multi-label metrics as described above. Assessing the quantitative features’ individual predictive performances via macro-precision and macro-recall on 10 folds of the training data, GO BP semantic similarity and the STRING “experimental” score showed the highest performance (Fig 4b). Overall, the full model showed strong performance in cross-validation, reaching an average precision of 0.713 and recall of 0.679 at its nominal probability cutoff of 0.5 and outperforming the individual features (Fig 4c). The macro-F1 score at this cutoff, 0.679, was near-maximal (Fig 4c). Relaxing the cutoff slightly would improve recall, and thus F1, without significant loss of precision. To verify that the choice of machine learning model does not strongly impact performance, I repeated the cross-validation analysis, substituting a Support Vector Machine for the Random Forest model. Performance was overall quite similar (macro-precision: 0.676; macro-recall: 0.620; macro-F1: 0.626), confirming that the choice of in vivo model is indeed flexible.

I next evaluated IV-KAPhE’s performance on an external data set. To achieve this, I collected kinase-substrate relationships identified using the ProtMapper method from databases and text-mining [39]. I omitted any relationships present in the PhosphoSitePlus database, from in vivo or in vitro experiments, to avoid validating on sites used in training IV-KAPhE or other methods. I then matched these 6199 previously unseen relationships with an equal number of random kinase-site relationships.

I compared IV-KAPhE to the in vitro Naïve Bayes+ and phosphoproteome-backed PSSM models, as well as to other previously published methods with similar or better kinome coverage (Fig C in S1 Text). To the best of my knowledge, the methods that meet that criterion are GPS 5.0 [17] and LinkPhinder [27], both of which have greater coverage than IV-KAPhE. I also compared it to NetworKIN 3.0 [12, 40], which has much lower coverage, but it shares a similar in vitro/in vivo hierarchical structure as IV-KAPhE. It must be stressed that none of these previously published methods were explicitly developed for multi-label assignment. Nevertheless, in order to compare their performance to IV-KAPhE, they will herein be evaluated under a multi-label paradigm. Assignments from GPS were as selected by the software’s default, “medium”-stringency behavior, because each kinase requires a different score cutoff. For LinkPhinder and NetworKIN, I evaluated performance over a range of cutoffs, with a focus on LinkPhinder’s published high-stringency cutoff of 0.672 [27] and NetworKIN’s nominal likelihood-ratio cutoff of 1.0 [40]. Each method covered a different, incomplete subset of the kinases in the test set (Fig 4d), with GPS 5.0 having the greatest coverage. NetworKIN has a significantly smaller coverage than the other methods. In order not to penalize models for coverage, each one was evaluated only on the subset of kinases that it could assign.

Like the training set, this test set is likely incomplete and cannot be fully evaluated in an all-versus-all multi-label sense. Therefore, only those relationships explicitly annotated in the test set were evaluated. I first looked at per-kinase F1 score performance, which underlies the macro-averaged metric (Fig 4e). From this view, it is clear that IV-KAPhE produces largely consistent, high F1 performance across both S/T and Y kinases compared to the other methods. NetworKIN, GPS, and LinkPhinder all exhibit highly varied performance, with GPS notably showing weaker performance for Y kinases. PSSMs and Naïve Bayes+ likewise show varied performance and weak Y-kinase performance. Note that kinases with few test-case sites tend to cluster near 0 and 1 due to lack of resolution in calculating precision and recall.

In macro-averaged precision and recall, LinkPhinder and GPS 5.0 performed only as well as the simpler, in vitro PSSM and Naïve Bayes+ models (Fig 4f). NetworKIN and IV-KAPhE together showed the best precision, but IV-KAPhE provided it with superior recall and averaged across a much larger portion of the kinome (Fig 4f). Returning to the macro-F1 score, we similarly see that IV-KAPhE better balances precision and recall than the other methods (Fig 4g). As this test set is balanced for each kinase between positive and negative cases, a ROC analysis is feasible (Fig 4h). Comparing the macro-averaged false-positive and true-positive rates provides further evidence that IV-KAPhE (AUC = 0.833) out-performs the other methods (AUC: LinkPhinder = 0.705; NetworKIN 3.0 = 0.673; Naive Bayes+ = 0.599; PSSM = 0.572; as a range of cutoffs were not tested for GPS 5.0, no AUC could be calculated).

I next evaluated how many novel assignments are generated by the methods when performing all-versus-all kinase-site assignment. Considering only the kinases assignable from the test set and only the test-set sites for which at least one true kinase had been assigned, I compared the macro-averaged false discovery rate (FDR) and true positive rate (TPR) across the different methods. Here “false discovery rate” is a misnomer, as we do not know whether these assignments are true or false. For all models, FDR dominates the TPR: while the fraction of known cases correctly assigned may be high, the fraction of assignments that are novel is much higher (Fig 4i). IV-KAPhE’s high FDR is nevertheless matched by a higher TPR than the other models. Thus, although IV-KAPhE also produces many novel assignments, we have a greater expectation of precision in its predictions

Computational assignments hypothesize widespread signaling cross-talk and redundancy

The topology of the human phosphorylation network is largely unresolved and biased towards commonly studied kinases [29, 34]. One open question is to what degree sites are phosphorylated by few kinases, as illustrated in canonical signaling pathway maps, versus multiple kinases through signaling noise, redundancy, or cross-talk between pathways. By applying a near kinome-scale kinase-site assignment model to a set of phosphosites representative of the phosphoproteome, we can produce hypotheses for such questions. Accordingly, I used Naïve Bayes+ and IV-KAPhE to generate all-versus-all kinase-substrate assignments for a set of 271432 unique human phosphosites (P > 0.5 assignments are archived at doi:10.5281/zenodo.6325198), derived from the union of the entire PhosphoSitePlus human phosphosite set [34] and a high-confidence human phosphoproteome [38]. I compared these to literature-derived assignments from PhosphoSitePlus and to LinkPhinder’s assignments for human phosphosites in PhosphoSitePlus (high-stringency cutoff).

Most sites with a causal kinase annotated in PhosphoSitePlus have a single kinase annotated to them and very few have more than four annotated kinases (Fig 5a). Thus, from the literature, we would suppose that multiple kinases rarely phosphorylate the same site. However this resource contains little data for understudied kinases such as isozymes of more commonly studied ones, which tend to have similar sequence specificities [29, 48].

thumbnail
Fig 5. Phosphoproteome-wide kinase assignments suggest more widespread multi-kinase phosphorylation than existing literature annotations do.

a) Histograms of the number of kinases associated with sites in the phosphoproteome reveal different views of the phosphorylation network. Literature annotations in PhosphoSitePlus suggest most sites are regulated by one or two kinases. In vitro Naïve Bayes+ predicts some S/T sites are “hubs” and all Y sites can be phosphorylated by most Y kinases. LinkPhinder and IV-KAPhE, in contrast, predict a long tail of hub sites. b) Histograms of the median number of kinases assigned per site for all proteins likewise show different predictions for hub proteins. Literature annotations suggest most proteins are phosphorylated by one kinase at each site. The computational methods all hypothesize multiple kinases per site, with some substrate proteins being very promiscuous at all their sites.

https://doi.org/10.1371/journal.pcbi.1010110.g005

In contrast, Naïve Bayes+ predicts that most S/T sites are phosphorylated by multiple kinases, but generally fewer than 10 (Fig 5a). However, it also predicts that a subset of S/T sites can be phosphorylated by around 20 different kinases. Conversely, the method assigns most tyrosine kinases to each Y site, pointing to a clear technical shortcoming of the in vitro model: cellular context plays the major role in determining Y kinase specificity. LinkPhinder and IV-KAPhE, on the other hand, both produce long-tailed distributions, with some sites having over 100 kinases assigned to them (Fig 5a). Reassuringly, the multi-modal distribution of Naïve Bayes+ is smoothed out by the biological context incorporated into IV-KAPhE, and it assigns fewer kinases to Y sites than Naïve Bayes+.

We can similarly ask whether specific proteins act as signaling hubs between pathways, with multiple kinases phosphorylating each of their phosphosites. To answer this, I compared the median number of kinases assigned to sites for each substrate protein (Fig 5b). Again, literature-based assignments in PhosphoSitePlus suggest that most proteins are phosphorylated by a single kinase at each site. All three computational methods, on the other hand, propose the hypothesis that many proteins can be phosphorylated by multiple kinases at each of their sites (Fig 5b), i.e. that functional hubs on the protein-protein functional association network encounter many kinases, each with potentially similar, degenerate sequence specificity to the others.

One possible technical explanation of this for IV-KAPhE is that hub proteins’ strong functional association scores may override low Naïve Bayes+ probabilities via some branches in the Random Forest model. This could arise from low, false-negative Naïve Bayes+ probabilities in the training set. As a result, IV-KAPhE would produce false positives for proteins occupying central positions in the network. For sites on such proteins, a post-hoc, stringent filter could be applied to select only assigned kinases with high Naïve Bayes+ scores.

IV-KAPhE predictions identify possible misannotations of kinase isozyme activity

The composition of the human kinome is the result of extensive duplication events and is thus defined by families of kinases with highly similar sequence specificities [48]. Among closely related kinases, frequently treated as isozymes (and as which I will refer to them for brevity), often only one or two receive significant research attention. This leads to an imbalance in kinase-substrate annotations among them [29]. Taking advantage of IV-KAPhE’s broad kinome coverage, which is less biased in composition than literature annotations, I investigated patterns of substrate assignment among closely related isozymes.

First, I considered ribosomal protein S6 kinase alpha (S6K-α) isozymes, among which isozymes 1 and 3 are the most commonly studied [29]. Using the full phosphoproteome assignments described in the previous section, I collected IV-KAPhE assignments of S6K-α isozymes for any site annotated in PhosphoSitePlus as being phosphorylated, in vitro or in vivo, by at least one of them (S1 Table; Fig 6a). The bias towards annotations for S6K-α-1 and -3 is plainly visible. There are a number of sites that IV-KAPhE predicts as being putative substrates of all six isozymes, pointing to multi-kinase phosphorylation, as well as some that are not well predicted for any isozyme.

thumbnail
Fig 6. IV-KAPhE can identify instances where literature-derived kinase-phosphosite assignments are probably assigned in error to commonly studied isozyme, instead of a lesser-studied one.

a) IV-KAPhE assignments of ribosomal protein S6 kinase alpha isozymes for all sites annotated in PhosphoSitePlus as being phosphorylated by at least one of the isozymes. Red colors indicate assignments predicted as likely by IV-KAPhE. Highlighted sites, discussed in the text, are examples that IV-KAPhE predicts are more likely to be phosphorylated by a different isozyme than the one annotated. b) IV-KAPhE assignments of calcium/calmodulin-dependent protein kinase type II subunits to annotated sites, as described in (a).

https://doi.org/10.1371/journal.pcbi.1010110.g006

More interestingly, there are sites for which an isozyme has stronger probability of phosphorylating the substrate than the annotated isozyme(s), supporting the argument that multi-label assignment should be carried out for kinases themselves rather than for kinase families or higher levels of classification. For example, site S103 on serum response factor (SRF; Uniprot: P11831 / SRF_HUMAN) is annotated as a substrate of isozyme 5, whereas isozyme 1 has the strongest evidence (Fig 6a). This annotation was derived from an in vitro study, in which the isozyme used was not specified [49]. Two sites on Rho GTPase-activating protein 31 (ARHGAP31; Uniprot: Q2M1Z3 / RHG31_HUMAN), S1106 and S1178, are annotated as substrates of isozyme 1, whereas IV-KAPhE gives a stronger probability to isozyme 2 (Fig 6a). In this case, a S6K-α-1 gene construct was used to induce phosphorylation under controlled conditions, while siRNAs targeting isozymes 1 and 3 were used to validate endogenous phosphorylation [50]. Furthermore, site S1178 was not, in fact, tested, but rather merely identified as a putative site by S6K-α sequence-motif analysis [50]. Finally, site T929 on protein KIBRA (WWC1; Uniprot: Q8IX03 / KIBRA_HUMAN) is annotated to isozymes 1 and 3, however IV-KAPhE assigns low probabilities to both of these, instead favoring isozyme 5 (Fig 6a). Here, the annotations are based on an in vitro analysis using recombinant isozymes 1 and 3 [51]. By incorporating in vivo information, IV-KAPhE proposes the more likely causal isozyme. I note that in many of these cases, IV-KAPhE assigns multiple kinases to the sites, albeit at varying degrees of probability, so the original assignments may be correct but incomplete.

I then carried out a similar analysis for calcium/calmodulin-dependent protein kinase type II (CaMK-II) subunits, which can form homo- or heteromultimeric holoenzymes, potentially complicating kinase assignment. As with S6K-α, some sites are assigned with similar probabilities to multiple subunits, while others point to possible misannotations (S2 Table; Fig 6b). For example, sites S252, S257, S282, and S285 on transcription factor C-ets-1 (ETS1; Uniprot: P14921 / ETS1_HUMAN) are all annotated to the most commonly studied subunit, α, whereas IV-KAPhE indicates that the evidence supports the least commonly studied subunit, β [29], as the causal protein kinase. In the associated study, phosphorylation was tested in vitro using the α subunit, while it was tested in vivo through expression of a β-γ construct [52]. Interestingly, while PhosphoSitePlus only features the in vitro assignment to the α subunit, the ProtMapper corpus includes an assignment to the β subunit. As another example, site S2808 on ryanodine receptor 2 (RYR2; Uniprot: Q92736 / RYR2_HUMAN) is annotated to subunit α, whereas IV-KAPhE most strongly assigns it to the δ subunit (Fig 6b). In this case, the subunit or subunits used in the original, in vitro experiment are not specified [53]. Finally, in a similar case, sites S165 and T154 on RING finger and CHY zinc finger domain-containing protein 1 (RCHY1; Uniprot: Q96PM5 / ZN363_HUMAN) are annotated to subunit α, whereas IV-KAPhE assigns a very low probability to this subunit, strongly preferring subunit δ. This annotation was derived from the in vitro use of a CaMK-II inhibitor (Autocamtide-2 Related Inhibitory Peptide II, for which subunit specificity has not been described) and a recombinant rat kinase, the subunit of which was not specified [54].

IV-KAPhE better enables high-coverage kinase activity analysis

A primary goal in performing a phosphoproteomic experiment is to deduce the signaling events that generated the data. Given quantitative phosphoproteomic measurements and a list of known substrates, the relative activity of a kinase can be inferred by a variety of methods, such as a Z-test, gene set enrichment analysis, or multiple linear regression models [55, 56]. Typically, literature-derived kinase-substrate relationships are used, because false-positive substrate assignments from in silico methods introduce large variance into the pool of measurements of the substrates [56]. However, doing so inherently limits the numbers of kinases for which relative activity can be inferred.

To test whether IV-KAPhE permits more accurate inference of kinase activity over past kinase-substrate prediction or assignment methods, I performed a kinase-activity analysis on a quantitative phosphoproteomics data set in which 20 different protein kinase inhibitors were applied to MCF7 cells [41]. For each condition, we expect the protein kinases targeted by the chemical inhibitor (Table A in S1 Text) to exhibit decreased activity. Furthermore, any protein kinases that are enzymatically activated by a target kinase should also exhibit decreased activity, while kinases whose activity is negatively regulated by a target kinase should show increased activity. These secondary expectations are tempered by the possibility of compensatory regulatory activity by other protein kinases. To account for these secondary inhibition effects, I identified all kinases that are likely to be regulated by each target kinase by integrating a literature-derived, signed kinase regulatory network from the Omnipath service [42] with a computationally predicted, signed kinase-kinase regulatory network [28].

For each putatively affected kinase under each condition, I calculated Z-score-based kinase activity scores [56], based on kinase-phosphosite assignments from each of the in silico methods assessed above (Fig 7a). Similarly, I calculated kinase activities using in vivo literature-derived annotations from PhosphoSitePlus, which provides the gold-standard in kinase activity inference [56] (Fig 7a). Surprisingly, PhosphoSitePlus annotations enabled very few kinases to be inferred as having altered activity, however notably these were generally targets of inhibition. LinkPhinder similarly enabled few inferences of altered activity. NetworKIN 3.0, GPS 5.0, and IV-KAPhE all produced substantially more inferences, however these also included inferences of unexpected activities (e.g. up-regulation of a target kinase). Such errors could be attributable to, for example, false-positive substrate assignments, false-positive regulatory relationships, technical noise in phosphosite quantification, or compensatory phosphorylation of true-positive targets by other kinases.

thumbnail
Fig 7. IV-KAPhE enables more robust and consistent kinase-activity inference.

a) Target kinases and their downstream substrate kinases in a multi-inhibitor quantitative phosphoproteomics experiment [41] are expected to have altered enzymatic activities. Assignments derived from NetworKIN 3.0, GPS 5.0, and IV-KAPhE make stronger inferences than those from LinkPhinder or in vivo literature-derived annotations from PhosphoSitePlus, however these methods also erroneously predict increased activity in some target kinases and downstream substrate kinases that they enzymatically activate. Each column represents a different kinase inhibitor condition (see Table A in S1 Text), in which green dots are direct targets of the inhibitor, orange triangles are kinases that are enzymatically activated by a target kinase, and violet squares are kinases that are enzymatically inhibited by a target kinase. Gray, dashed lines indicate activity levels of -2.5 and 2.5, corresponding to Z-test p-values of 10−2.5. b) IV-KAPhE provides more consistent inference of negative activity in target kinases and substrate kinases that they enzymatically activate than other computational methods as well as in vivo literature-derived annotations from PhosphoSitePlus. Point colors and shapes, as well as the gray dashed lines, are as described for panel (a).

https://doi.org/10.1371/journal.pcbi.1010110.g007

Focusing on kinases that are expected to be down-regulated, ideal activity inferences would be strictly negative. By pooling these kinases from all of the conditions (noting that some kinases may appear multiple times) and observing the distribution of inferred activities, we can compare the general performance of each of the methods (Fig 7b). In pairwise comparisons, IV-KAPhE infers significantly lower activities than all of the other methods (one-sided Wilcoxon signed-rank test on matched kinase pairs, with Benjamini-Hochberg p-value correction for false discovery rate; IV-KAPhE vs. Networkin 3.0: n = 175, V = 4768, p = 6.3 × 10−6; vs. GPS 5.0: n = 262, V = 12272, p = 2.7 × 10−5; vs. LinkPhinder: n = 273, V = 13474, p = 3.1 × 10−5; vs. PhosphoSitePlus: n = 122, V = 1717, p = 1.0 × 10−7). The mean difference in inferred activities for matched pairs between IV-KAPhE and the other assignment sources ranged from -0.614 (LinkPhinder) to -0.943 (PhosphoSitePlus).

The kinase activity score used here is based on the log10-transformed p-value of the Z-test. These results indicate that the scores derived from IV-KAPhE assignments correspond to p-values that are, on average, up to an order of magnitude smaller than those produced from the other assignment sources. Taking appropriate precautions concerning the possibilities of false-positive assignments, then, the use of IV-KAPhE kinase-substrate assignments can provide more confident activity inferences, on average, than even in vivo literature-derived annotations and it does so with larger kinome coverage than literature-derived sources.

Discussion

With the widespread availability of phosphoproteomics, methods are needed for confidently assigning protein kinases to observed phosphosites, accounting for the possibility of multiple causal kinases. Although many kinase-substrate prediction or assignment methods have been produced in the past, to the best of my knowledge no method has been specifically developed for multi-label assignment of kinases to phosphosites to meet this need. Indeed, evaluating past, performant methods in a multi-label setting, for which they were not designed, herein revealed a tendency for low average performance across their full sets of covered kinases. By being built around hypothesis-free phosphoproteomic data and avoiding, where possible, functional annotations biased towards commonly studied kinases [29], IV-KAPhE exhibits stronger average performance across the kinome, making it more suitable than past methods for the modern task of multi-label assignment of kinases to phosphoproteomic data.

Kinase-substrate annotations are not available for most species to the same level as humans or model species. Thus, kinase-substrate assignment methods that depend on such annotations cannot be applied to those species. The IV-KAPhE method presented here requires a high-throughput, phosphoproteomic kinase-substrate assay; high-throughput physical interaction data; Gene Ontology annotations, which are often have good coverage by orthology; and STRING scores, which are available for many species. This makes it a suitable method for kinase-site assignment in non-model species. While the first two requirements are, indeed, non-trivial and expensive, they are less onerous and time-consuming than low-throughput assays of individual kinase-substrates at a kinomic scale. Furthermore, the generality of the training data for the in vivo part of IV-KAPhE means that it may be possible to use kinase-substrate annotations from humans or model species as functional-association training data for non-model species, a possibility that remains to be explored.

Even stringently assessed, the human phosphoproteome consists of over 100,000 different sites on at least 12,000 different proteins [38], of which only a fraction have literature-derived kinase assignments. These assignments are furthermore biased towards well-studied protein kinases [29]. Thus the functional roles of many human protein kinases, including closely related isozymes of more commonly studied kinases, remain unknown. Applying IV-KAPhE predictions revealed the perils of these biases. On one hand, researchers use commonly studied isozymes for in vitro or artificial in vivo analysis, whereas the endogenous causal kinase may be a different isozyme. On the other hand, when isozymes are not adequately specified in literature, annotators may inappropriately default to assigning the most commonly studied isozyme to a site. As a result, the network mapping kinases to each human phosphosite remains not only largely unresolved, but its topology cannot be accurately extrapolated from existing literature-derived annotations.

Due to poor sequence conservation, many phosphosites are expected to be “off-target” and without function [38, 57, 58]. Nevertheless, the methods presented here confidently assign multiple kinases to most sites. They collectively posit, based on the best available data, that this off-target noise is not due to low-probability events on non-optimized substrate sequences, as suggested by signaling dynamics [59]. They thereby propose that the kinase-substrate network is densely connected. Incorporating further biochemical constraints into future models may reduce this apparent density and reject the computational hypothesis. Otherwise, the results will equally suggest that on-target, functional phosphorylation also can generally be catalyzed by multiple kinases, raising the question of how kinase functional specialization is maintained across the human kinome.

Supporting Information

S1 Table. IV-KAPhE assignments of S6K-α isoforms for any site annotated in PhosphoSitePlus as being phosphorylated by at least one of these isoforms in vivo or in vitro.

https://doi.org/10.1371/journal.pcbi.1010110.s001

(XLSX)

S2 Table. IV-KAPhE assignments of CaMK-II subunits for any site annotated in PhosphoSitePlus as being phosphorylated by at least one of these subunits in vivo or in vitro.

https://doi.org/10.1371/journal.pcbi.1010110.s002

(XLSX)

S1 Text. Supplemental methods, figures and tables.

S1 Text contains the following: Supplemental Methods—Additional details describing data sources, computational methods, and mathematical derivations of specificity models; Fig A—Using relative entropy as a PFM or PSSM column weight provides greater discrimination of columns than using information content; Fig B—Differing theoretical and empirical score ranges in scoring matrix-based models hinders the selection of a universal score cut-off; Fig C—A Venn diagram showing total coverage of the human kinome supported by four kinase-substrate prediction or assignment methods; and Table A—Protein kinases targeted by the chemical inhibitors used in the quantitative phosphoproteomic experiment of Wilkes et al.[41].

https://doi.org/10.1371/journal.pcbi.1010110.s003

(PDF)

Acknowledgments

The author would like to thank Pedro Beltrao for helpful comments on a draft of this manuscript.

References

  1. 1. Ochoa D, Bradley D, Beltrao P. Evolution, Dynamics and Dysregulation of Kinase Signalling. Current Opinion in Structural Biology. 2018;48:133–140. pmid:29316484
  2. 2. Bradley D, Viéitez C, Rajeeve V, Selkrig J, Cutillas PR, Beltrao P. Sequence and Structure-Based Analysis of Specificity Determinants in Eukaryotic Protein Kinases. Cell Rep. 2021;34(2):108602. pmid:33440154
  3. 3. Yaffe MB, Leparc GG, Lai J, Obata T, Volinia S, Cantley LC. A Motif-Based Profile Scanning Approach for Genome-Wide Prediction of Signaling Pathways. Nature Biotechnology. 2001;19(4):348–353. pmid:11283593
  4. 4. Obenauer JC, Cantley LC, Yaffe MB. Scansite 2.0: Proteome-Wide Prediction of Cell Signaling Interactions Using Short Sequence Motifs. Nucleic Acids Res. 2003;31(13):3635–3641. pmid:12824383
  5. 5. Miller ML, Jensen LJ, Diella F, Jørgensen C, Tinti M, Li L, et al. Linear Motif Atlas for Phosphorylation-Dependent Signaling. Sci Signal. 2008;1(35):ra2–ra2. pmid:18765831
  6. 6. Jung I, Matsuyama A, Yoshida M, Kim D. PostMod: Sequence Based Prediction of Kinase-Specific Phosphorylation Sites with Indirect Relationship. BMC Bioinformatics. 2010;11(Suppl 1):S10. pmid:20122181
  7. 7. Safaei J, Maňuch J, Gupta A, Stacho L, Pelech S. Prediction of 492 Human Protein Kinase Substrate Specificities. Proteome Sci. 2011;9(Suppl 1):S6. pmid:22165948
  8. 8. Wagih O, Reimand J, Bader GD. MIMP: Predicting the Impact of Mutations on Kinase-Substrate Phosphorylation. Nat Methods. 2015;12(6):531–533. pmid:25938373
  9. 9. Krystkowiak I, Manguy J, Davey NE. PSSMSearch: A Server for Modeling, Visualization, Proteome-Wide Discovery and Annotation of Protein Motif Specificity Determinants. Nucleic Acids Res. 2018;46(W1):W235–W241. pmid:29873773
  10. 10. Blom N, Gammeltoft S, Brunak S. Sequence and Structure-Based Prediction of Eukaryotic Protein Phosphorylation Sites1 1Edited by Cohen F. E.. Journal of Molecular Biology. 1999;294(5):1351–1362. pmid:10600390
  11. 11. Blom N, Sicheritz-Pontén T, Gupta R, Gammeltoft S, Brunak S. Prediction of Post-Translational Glycosylation and Phosphorylation of Proteins from the Amino Acid Sequence. PROTEOMICS. 2004;4(6):1633–1649. pmid:15174133
  12. 12. Linding R, Jensen LJ, Ostheimer GJ, van Vugt MATM, Jørgensen C, Miron IM, et al. Systematic Discovery of In Vivo Phosphorylation Networks. Cell. 2007;129(7):1415–1426. pmid:17570479
  13. 13. Kim JH, Lee J, Oh B, Kimm K, Koh I. Prediction of Phosphorylation Sites Using SVMs. Bioinformatics. 2004;20(17):3179–3184. pmid:15231530
  14. 14. Dou Y, Yao B, Zhang C. PhosphoSVM: Prediction of Phosphorylation Sites by Integrating Various Protein Sequence Attributes with a Support Vector Machine. Amino Acids. 2014;46(6):1459–1469. pmid:24623121
  15. 15. Zhou FF, Xue Y, Chen GL, Yao X. GPS: A Novel Group-Based Phosphorylation Predicting and Scoring Method. Biochemical and Biophysical Research Communications. 2004;325(4):1443–1448. pmid:15555589
  16. 16. Xue Y, Liu Z, Cao J, Ma Q, Gao X, Wang Q, et al. GPS 2.1: Enhanced Prediction of Kinase-Specific Phosphorylation Sites with an Algorithm of Motif Length Selection. Protein Eng Des Sel. 2011;24(3):255–260. pmid:21062758
  17. 17. Wang C, Xu H, Lin S, Deng W, Zhou J, Zhang Y, et al. GPS 5.0: An Update on the Prediction of Kinase-specific Phosphorylation Sites in Proteins. Genomics Proteomics Bioinformatics. 2020;18(1):72–80. pmid:32200042
  18. 18. Brinkworth RI, Breinl RA, Kobe B. Structural Basis and Prediction of Substrate Specificity in Protein Serine/Threonine Kinases. PNAS. 2003;100(1):74–79. pmid:12502784
  19. 19. Datta S, Mukhopadhyay S. A Grammar Inference Approach for Predicting Kinase Specific Phosphorylation Sites. PLoS One. 2015;10(4):e0122294. pmid:25886273
  20. 20. von Stechow L, Francavilla C, Olsen JV. Recent Findings and Technological Advances in Phosphoproteomics for Cells and Tissues. Expert Rev Proteomics. 2015;12(5):469–487. pmid:26400465
  21. 21. Zou L, Wang M, Shen Y, Liao J, Li A, Wang M. PKIS: Computational Identification of Protein Kinases for Experimentally Discovered Protein Phosphorylation Sites. BMC Bioinformatics. 2013;14:247. pmid:23941207
  22. 22. Yang P, Humphrey SJ, James DE, Yang YH, Jothi R. Positive-Unlabeled Ensemble Learning for Kinase Substrate Prediction from Dynamic Phosphoproteomics Data. Bioinformatics. 2016;32(2):252–259. pmid:26395771
  23. 23. Wang M, Wang T, Li A. ksrMKL: A Novel Method for Identification of Kinase-Substrate Relationships Using Multiple Kernel Learning. PeerJ. 2017;5:e4182. pmid:29340231
  24. 24. Ayati M, Wiredja D, Schlatzer D, Maxwell S, Li M, Koyutürk M, et al. CoPhosK: A Method for Comprehensive Kinase Substrate Annotation Using Co-Phosphorylation Analysis. PLoS Comput Biol. 2019;15(2). pmid:30811403
  25. 25. Wagih O, Sugiyama N, Ishihama Y, Beltrao P. Uncovering Phosphorylation-Based Specificities through Functional Interaction Networks. Mol Cell Proteomics. 2016;15(1):236–245. pmid:26572964
  26. 26. Ma H, Li G, Su Z. KSP: An Integrated Method for Predicting Catalyzing Kinases of Phosphorylation Sites in Proteins. BMC Genomics. 2020;21(1):537. pmid:32753030
  27. 27. Nováček V, McGauran G, Matallanas D, Vallejo Blanco A, Conca P, Muñoz E, et al. Accurate Prediction of Kinase-Substrate Networks Using Knowledge Graphs. PLoS Comput Biol. 2020;16(12):e1007578. pmid:33270624
  28. 28. Invergo BM, Petursson B, Akhtar N, Bradley D, Giudice G, Hijazi M, et al. Prediction of Signed Protein Kinase Regulatory Circuits. Cell Syst. 2020;10(5):384–396.e9. pmid:32437683
  29. 29. Invergo BM, Beltrao P. Reconstructing Phosphorylation Signalling Networks from Quantitative Phosphoproteomic Data. Essays Biochem. 2018;62(4):525–534. pmid:30072490
  30. 30. Stambolic V, Woodgett JR. Functional Distinctions of Protein Kinase B/Akt Isoforms Defined by Their Influence on Cell Migration. Trends Cell Biol. 2006;16(9):461–466. pmid:16870447
  31. 31. Linnerth-Petrik NM, Santry LA, Petrik JJ, Wootton SK. Opposing Functions of Akt Isoforms in Lung Tumor Initiation and Progression. PLoS One. 2014;9(4):e94595. pmid:24722238
  32. 32. Hinz N, Jücker M. Distinct Functions of AKT Isoforms in Breast Cancer: A Comprehensive Review. Cell Commun Signal. 2019;17(1):154. pmid:31752925
  33. 33. Higgins CA, Nilsson-Payant BE, Kurland AP, Adhikary P, Golynker I, Danziger O, et al. SARS-CoV-2 Hijacks P38ß/MAPK11 to Promote Viral Protein Translation; 2021.
  34. 34. Hornbeck PV, Zhang B, Murray B, Kornhauser JM, Latham V, Skrzypek E. PhosphoSitePlus, 2014: Mutations, PTMs and Recalibrations. Nucleic Acids Research. 2015;43(Database–issue):D512–20. pmid:25514926
  35. 35. Henikoff S, Henikoff JG. Position-Based Sequence Weights. Journal of Molecular Biology. 1994;243(4):574–578. pmid:7966282
  36. 36. Henikoff JG, Henikoff S. Using Substitution Probabilities to Improve Position-Specific Scoring Matrices. Comput Appl Biosci. 1996;12(2):135–143. pmid:8744776
  37. 37. Wright MN, Ziegler A. Ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software. 2017;77(1):1–17.
  38. 38. Ochoa D, Jarnuczak AF, Viéitez C, Gehre M, Soucheray M, Mateus A, et al. The Functional Landscape of the Human Phosphoproteome. Nat Biotechnol. 2020;38(3):365–373. pmid:31819260
  39. 39. Bachman JA, Gyori BM, Sorger PK. Assembling a Phosphoproteomic Knowledge Base Using ProtMapper to Normalize Phosphosite Information from Databases and Text Mining. bioRxiv. 2019; p. 822668.
  40. 40. Horn H, Schoof EM, Kim J, Robin X, Miller ML, Diella F, et al. KinomeXplorer: An Integrated Platform for Kinome Biology Studies. Nat Methods. 2014;11(6):603–604. pmid:24874572
  41. 41. Wilkes EH, Terfve C, Gribben JG, Saez-Rodriguez J, Cutillas PR. Empirical Inference of Circuitry and Plasticity in a Kinase Signaling Network. Proc Natl Acad Sci USA. 2015;112(25):7719–7724. pmid:26060313
  42. 42. Türei D, Korcsmáros T, Saez-Rodriguez J. OmniPath: Guidelines and Gateway for Literature-Curated Signaling Pathway Resources. Nat Methods. 2016;13(12):966–967. pmid:27898060
  43. 43. Sugiyama N, Imamura H, Ishihama Y. Large-Scale Discovery of Substrates of the Human Kinome. Scientific Reports. 2019;9(1):10503. pmid:31324866
  44. 44. Zhang ML, Peña JM, Robles V. Feature Selection for Multi-Label Naive Bayes Classification. Information Sciences. 2009;179(19):3218–3229.
  45. 45. Davis J, Goadrich M. The Relationship between Precision-Recall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning—ICML’06. Pittsburgh, Pennsylvania: ACM Press; 2006. p. 233–240.
  46. 46. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, et al. STRING: Known and Predicted Protein–Protein Associations, Integrated and Transferred across Organisms. Nucleic Acids Res. 2005;33(Database Issue):D433–D437. pmid:15608232
  47. 47. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING V11: Protein-Protein Association Networks with Increased Coverage, Supporting Functional Discovery in Genome-Wide Experimental Datasets. Nucleic Acids Res. 2019;47(D1):D607–D613. pmid:30476243
  48. 48. Bradley D, Beltrao P. Evolution of Protein Kinase Substrate Recognition at the Active Site. PLoS Biol. 2019;17(6):e3000341. pmid:31233486
  49. 49. Rivera VM, Miranti CK, Misra RP, Ginty DD, Chen RH, Blenis J, et al. A Growth Factor-Induced Kinase Phosphorylates the Serum Response Factor at a Site That Regulates Its DNA-binding Activity. Mol Cell Biol. 1993;13(10):6260–6273. pmid:8413226
  50. 50. Ben Djoudi Ouadda A, He Y, Calabrese V, Ishii H, Chidiac R, Gratton JP, et al. CdGAP/ARHGAP31 Is Regulated by RSK Phosphorylation and Binding to 14-3-3β Adaptor Protein. Oncotarget. 2018;9(14):11646–11664. pmid:29545927
  51. 51. Yang S, Ji M, Zhang L, Chen Y, Wennmann DO, Kremerskothen J, et al. Phosphorylation of KIBRA by the Extracellular Signal-Regulated Kinase (ERK)—Ribosomal S6 Kinase (RSK) Cascade Modulates Cell Proliferation and Migration. Cell Signal. 2014;26(2):343–351. pmid:24269383
  52. 52. Liu H, Grundström T. Calcium Regulation of GM-CSF by Calmodulin-Dependent Kinase II Phosphorylation of Ets1. Mol Biol Cell. 2002;13(12):4497–4507. pmid:12475968
  53. 53. Rodriguez P, Bhogal MS, Colyer J. Stoichiometric Phosphorylation of Cardiac Ryanodine Receptor on Serine 2809 by Calmodulin-dependent Kinase II and Protein Kinase A *. Journal of Biological Chemistry. 2003;278(40):38593–38600.
  54. 54. Duan S, Yao Z, Hou D, Wu Z, Zhu Wg, Wu M. Phosphorylation of Pirh2 by Calmodulin-dependent Kinase II Impairs Its Ability to Ubiquitinate P53. The EMBO Journal. 2007;26(13):3062. pmid:17568776
  55. 55. Casado P, Rodriguez-Prados JC, Cosulich SC, Guichard S, Vanhaesebroeck B, Joel S, et al. Kinase-Substrate Enrichment Analysis Provides Insights into the Heterogeneity of Signaling Pathway Activation in Leukemia Cells. Sci Signal. 2013;6(268):rs6. pmid:23532336
  56. 56. Hernandez-Armenta C, Ochoa D, Gonçalves E, Saez-Rodriguez J, Beltrao P. Benchmarking Substrate-Based Kinase Activity Inference Using Phosphoproteomic Data. Bioinformatics. 2017;33(12):1845–1851. pmid:28200105
  57. 57. Landry CR, Levy ED, Michnick SW. Weak Functional Constraints on Phosphoproteomes. Trends Genet. 2009;25(5):193–197. pmid:19349092
  58. 58. Levy ED, Michnick SW, Landry CR. Protein Abundance Is Key to Distinguish Promiscuous from Functional Phosphorylation Based on Evolutionary Information. Philosophical Transactions of the Royal Society B: Biological Sciences. 2012;367(1602):2594–2606. pmid:22889910
  59. 59. Kanshin E, Kubiniok P, Thattikota Y, D’Amours D, Thibault P. Phosphoproteome Dynamics of Saccharomyces Cerevisiae under Heat Shock and Cold Stress. Mol Syst Biol. 2015;11(6):813. pmid:26040289