The author has declared that no competing interests exist.
The collection of mass spectra obtained at different elution times forms an LC-MS run
shown in
(A) LC-MS run. Features in the LC-MS space are peptide ions; their intensity is related to peptide abundance. (B) MS/MS spectrum. The spectrum is obtained by fragmenting the peptide ion isolated from an LC-MS peak. The peaks are fragment ions; distances between peaks are used for peptide sequence determination.
Peak intensity is related to the abundances of peptides, and can be used for relative
quantification. With the label-free approach, a separate LC-MS run is obtained for
each biological sample, and peaks are quantified and compared across runs. In stable
isotopic labeling workflow, samples from different groups are labeled metabolically
(e.g., in SILAC, where stable isotopes are included in the growth medium of an
organism), or chemically (e.g., in ICAT or iTRAQ, where reacting chemical labels are
applied after tryptic digestion). Several samples (e.g., one from each group) are
then mixed, and their peaks are identified and quantified within the same run.
Finally, a targeted workflow based, for example, on selected reaction monitoring
(SRM)
The design of proteomic experiments, and subsequent analysis of the spectra, involves extensive computation and requires expertise at the intersection of computer science, engineering, and statistics. It presents exciting opportunities for both methodological and applied computational research.
Experimental design specifies how biological samples are selected and allocated in space and time during spectral acquisition. For example, a biomarker discovery project can produce biased conclusions if patients from different groups have different characteristics (such as prior medication), or their spectra are acquired under different conditions. Moreover, sample selection and allocation can be inefficient, and can undermine the ability to uncover the true differences between groups.
Statistical experimental design avoids bias and optimizes efficiency by using
replication, randomization, and blocking, and by choosing an appropriate type and
number of replicates
After spectral acquisition, the first computational task is to extract and store peak
information. Unfortunately, most mass spectrometer vendors have their own
proprietary formats. An advance has been made by implementing open XML-based formats
(such as mzXML), and the associated converters and validators, to store this
information and to make the subsequent analysis vendor-neutral
An MS/MS spectrum such as in
(A) Identification of MS/MS spectra. Experimental spectra are compared to peptides in a database, and the best-scoring PSMs are reported while controlling the FDR. Protein sequences are identified from the peptides. (B) Label-free quantification. Features in LC-MS runs (shown with circles) are located, quantified, and aligned across runs. (C) LC-MS features are annotated with peptide sequences when identifications are available (shown with filled circles). The annotations are used to optimize the alignment of features across runs. The list of quantified, identified, and aligned features is then subjected to transformation, normalization, and summarization. (D) The list of features is used as input to machine learning, functional annotation, and data integration steps.
Protein sequence databases now exist for many organisms. One can digest the
sequences in silico into peptides, and construct a theoretical spectrum for each
peptide. Alternatively, one can use a library of peptides with associated
consensus experimental spectra derived from previous identifications
Scoring functions quantify the similarity of a candidate peptide-spectrum match
(PSM). A typical two-stage procedure filters out PSMs with incompatible peptide
and precursor ion masses, and scores plausible PSMs using counts of shared MS/MS
peaks. Newer scores incorporate additional characteristics, e.g., peak intensity
(for spectral libraries) and empirical peptide detectability
For each observed spectrum, the algorithm scores its similarity to every
candidate peptide and returns the best-scoring PSM. Since typical experiments
produce hundreds of thousands of MS/MS spectra, development of efficient search
algorithms is an active area of research. Improvements include clustering the
observed spectra using a similarity metric, and only searching the resulting
consensus spectra
Due to the stochastic variation in the spectra, deficiencies of the scoring schemes, and possible incompleteness of the database, only a fraction of best-scoring PSMs are typically correct. There is thus a need for a statistical measure of “confidence” in a reported list of PSMs, and for an inferential procedure that distinguishes “confident” PSMs from noise.
An accepted statistical measure is FDR, defined as the expected proportion of
incorrect identifications in a list of PSMs with scores above a cutoff. To
determine FDR-controlled lists of PSMs, the target–decoy strategy
Confidently identified peptides can be grouped to infer the protein components of
the mixture. This is nontrivial due to ambiguous mappings of peptides to
proteins, and to the insufficient discrimination of some proteins by the
identified peptides
Extensive spectral databases are publicly available, e.g., the Peptide Atlas at
Quantitative proteomics monitors peptide and protein abundance across samples of
multiple types. The goals are similar to other high-throughput experiments such as
gene expression microarrays
Quantitative workflows require signal processing beyond spectral identification.
Features in the spectra must be located and quantified, annotated when possible
with peptide sequences information, and aligned across runs. A variety of tools
have been implemented
The biological effects are multiplicative in nature, and a logarithm transform of
intensities is frequently recommended. Feature intensities are further
normalized across runs, e.g., using quantile normalization
Statistical and machine learning tools are then applied for (1)
Database technologies connect the proteins to their annotations, e.g., from Gene
Ontology, or from databases of disease. The annotations can confirm the
plausibility of the identifications, and can enable tests for over-represented
functional categories in the protein list
Recent studies combine proteomic measurements with gene expression and
metabolomic profiles, and/or known biochemical networks, with the general goal
of protein function determination