I have read the journal's policy and have the following conflicts: GS acts as a scientific consultant to pharmaceutical industry and is a co-founder of inSili.com LLC, Zürich, and AlloCyte Pharmaceuticals Ltd, Basel. The authors have declared that no other competing interests exist.
Conceived and designed the experiments: PW GF JAH GS. Performed the experiments: CPK AMP MP NKT. Analyzed the data: CPK AMP MP PW GF JAH GS. Contributed reagents/materials/analysis tools: MP NKT. Wrote the paper: CPK AMP GS. Designed and set-up the software server used in analysis: CPK NKT GS.
Designed peptides that bind to major histocompatibility protein I (MHC-I) allomorphs bear the promise of representing epitopes that stimulate a desired immune response. A rigorous bioinformatical exploration of sequence patterns hidden in peptides that bind to the mouse MHC-I allomorph H-2Kb is presented. We exemplify and validate these motif findings by systematically dissecting the epitope SIINFEKL and analyzing the resulting fragments for their binding potential to H-2Kb in a thermal denaturation assay. The results demonstrate that only fragments exclusively retaining the carboxy- or amino-terminus of the reference peptide exhibit significant binding potential, with the N-terminal pentapeptide SIINF as shortest ligand. This study demonstrates that sophisticated machine-learning algorithms excel at extracting fine-grained patterns from peptide sequence data and predicting MHC-I binding peptides, thereby considerably extending existing linear prediction models and providing a fresh view on the computer-based molecular design of future synthetic vaccines. The server for prediction is available at
Future success in vaccine development will critically depend on identifying potent epitopes with reduced side effects. Among such candidate molecules, immunogenic peptides binding to major histocompatibility protein I (MHC-I) represent a preferred class of biomolecules for vaccine design. Computational models assist in the selection of the best candidate peptides by providing a mathematical rationale for antigen recognition by MHC-I. Here we present a machine-learning model that was trained on recognizing features of known MHC-I binding and non-binding peptide sequences with sustained accuracy. We were able to biochemically validate the computational predictions in a direct binding assay measuring complex formation between synthesized candidate peptides and MHC-I. Strong correspondence between the predictions and the experimentally determined binding potential corroborate the machine-learning model as viable for future antigen design. Thus, our study provides a concept for rapidly finding innovative MHC-I binding peptides with limited experimental effort.
Artificial induction of immunity (immunization) is achieved by priming the immune system with a specific antigen (epitope) bearing the potential of activating the adaptive immune response
Numbers within dashed boxes correspond to sequence positions in the respective epitope. Red boxes and associated amino acid codes indicate anchor positions and preferred amino acid composition respectively. The yellow box indicates the secondary anchor at position 3, according to the H-2Kb canonical sequence motif.
In this study we developed a cascaded machine learning approach to learn patterns from the available MHC-I binding information in the Immune Epitope Database (IEDB)
We extracted peptide data from version 7/2012 of the IEDB
The Immune Epitope Database (IEDB) served as data source
The core training set was split into a 10-fold cross-validation set and an external validation set by a ratio of 4∶1 in order to retain two evaluation scenarios (
ANN: feed-forward artificial neural network, SVM: support vector machine. AAFREQ, BINAATYPE, BINPEP, PEPCATS, PPCA and PPCALI correspond to the utilized peptide descriptors (
The 12 trained base classifiers were subsequently fed with the same data they were originally trained on to compute the input (
The final jury (
The model delivers a prediction score from the interval [0,1[, with high values indicating MHC-I H-2Kb binding.
All octapeptides containing at minimum tripeptide fragments of the positive reference binder SIINFEKL were categorically grouped according to the respective fragment (
(
The heptapeptide fragment-containing groups (xIINFEKL, SIINFEKx) comprising 20 peptides each exhibited the highest score distributions (
Tetrapeptide fragment-containing groups were categorized into low median groups xxxNFEKx, xxINFExx and SIINxxxx (0.01), and in opposition high median groups xIINFxxx (
Pentapeptide-fragment containing groups were divided into a high median category containing xxxNFEKL (0.98) as well as SIINFxxx (0.99), and the low median groups xIINFEx (0.16) and xxINFEKx (0.02), while the hexapeptide fragment-containing groups all exhibited high medians: xIINFEKx (0.95), SIINFExx (0.98), and xxINFEKL (0.99).
Examining groups with the same first non-arbitrary (non ‘
To test the predictions made by our machine-learning model, we synthesized all SIINFEKL fragments and measured MHC-I H-2Kb binding in a thermal denaturation assay (
(
Peptide | Relative binding |
|
EKL | 0.9 | - |
FEKL | 0.5 | - |
FEK | 1 | - |
NFEKL | 1 | - |
NFEK | 0.9 | - |
NFE | 0.9 | - |
INFEKL | 0.001 | 7% (2%) |
INFEK | 0.4 | - |
INFE | 1 | - |
INF | 0.7 | - |
IINFEKL | 0.001 | 22% (1%) |
IINFEK | 0.8 | - |
IINFE | 0.9 | - |
IINF | 0.9 | - |
IIN | 0.5 | - |
SIINFEK | 0.001 | 5% (1%) |
SIINFE | 0.001 | 4% (0%) |
SIINF | 0.001 | 8% (3%) |
SIIN | 0.9 | - |
SII | 0.9 | - |
SIINFEKL | 0.001 | 100% (1%) |
0.5 | - |
The null hypothesis for the Welch-test
It is noteworthy that the pentapeptide SIINF (8%) exhibited significantly higher (
The results of this study demonstrate the potential of coupling cascaded machine-learning models for predicting MHC-I antigen presentation to a rapid thermal denaturation assay for validation of direct binding to MHC. Even more, implicating from prediction distributions of SIINFEKL-fragment containing peptides to direct-binding measurements of actual fragments appears to be feasible. The newly established model allows for a fine-grained grasp of sequence motifs, suggesting that sequence length, as defined in previous studies by an optimum of 8–10 amino acids for the H-2Kb allele, plays a crucial role in the binding mechanism
Concerning the classic canonical sequence motif (
A further prospective analysis of the entire slice-and-diced mouse proteome revealed about 1.75% of octapeptides as confidently (
Ensemble models have shown their general usefulness in increasing predictive performance in comparison to their individual base classifiers
It must be kept in mind though that for some pathogens, an adequate immune response may require activation of not only the cell-mediated MHC-I supported CD8+ T cell response but also the assistance of MHC-II facilitated CD4+ T cell responses or antibody-driven humoral responses
We used WEKA
Thermal denaturation studies were conducted using a StepOnePlus real-time PCR system (Applied Biosystems) with MicroAmp optical 96-well plates (Applied Biosystems cat. no. N8010560). Wells were loaded with 10 µl of H-2Kb:IgG fusion protein solution (protein conc. = 0.5 mg×ml−1; BD Biosciences cat. no. 550750) – respectively 10 µl of PBS buffer (pH 7.36) for ligand-only controls – 2 µl of peptide or peptide fragments, 2 µl of 10× SYPRO Orange (SigmaAldrich cat. no. S5692) and 6 µl (8 µl for negative controls) to yield a total volume of 20 µl per well. Final concentrations calculated to 1 µM for H-2Kb:IgG fusion protein, 100 µM for the peptides/peptide fragments and 1× SYPRO Orange. Fluorescence intensity was measured using the Applied Biosystems ROX preset with respective excitation/emission maxima at 587/607 nm, while heating the wells continuously from 25°C to 99°C with a ramp rate of 1% (temperature increase of 1.5°C per minute). Results were recorded by StepOne 2.2.2 software and analyzed by identifying the local minimum of the derivative of the melting curve for the segment relevant for denaturation of the peptide-binding superdomain (α1,2) of the MHC-I heavy chain.
DMF (dimethylformamide), DCM (dichlormethane), diisopropylether, piperidine and TIPS (triisopropylsilane) were purchased from Sigma-Aldrich. NMM (4-methylmorpholine) and TFA (2,2,2-trifluoroacetic acid) were acquired from Fisher Scientific; Fmoc-protected Wang-resins, Fmoc-protected amino acids, and HCTU (
Sarah Haller is thanked for technical assistance.