The author has declared that no competing interests exist.
Carbohydrates are considered the third class of information-encoding biological macromolecules. “Glycomics,” the scientific attempt to characterize and study carbohydrates, is a rapidly emerging branch of science, for which informatics is just beginning. Glycomics requires sophisticated algorithmic approaches. Several algorithms and models have been developed for glycobiology research in the past several years. This tutorial will provide a brief introduction to the field of glycome informatics, which will include a primer on glycobiology as well as descriptions of the algorithms and models that have been developed in this field.
The four essential molecular building blocks of cells are nucleic acids, proteins, lipids, and carbohydrates, often referred to as glycans. Nucleotide and protein sequences are at the heart of nearly all bioinformatics applications and research, whereas glycan and lipid structures have been widely neglected in bioinformatics. However, glycans are the most abundant and structurally diverse biopolymers formed in nature. Bound to proteins, as glycoproteins, they are known to affect the functions of proteins. More than half of all protein sequences deposited in the SWISS-PROT databank include potential glycosylation sites and thus may be glycoproteins. Based on an analysis of well-annotated and characterized glycoproteins in SWISS-PROT, it was concluded that more than half of all proteins are glycosylated
The development and use of informatics tools and databases for glycobiology and glycomics research has increased considerably in recent years. However, the general development in this field can still be considered as being in its infancy when compared to the genomics and proteomics areas. In terms of bioinformatics in glycobiology, there are several paths of research that are currently in progress. The development of algorithms to reliably support the characterization of glycan structures for high-throughput applications is the most immediate demand of the glycomics community. Additionally, several major glyco-related projects (Consortium for Functional Glycomics
Complex carbohydrates are chains of monosaccharides, often called glycans, and are often found attached to proteins (to form glycoproteins) and lipids (glycolipids, glycosphingolipids, etc.). Glycoproteins are usually on the cell surface, where they are recognized by bacteria, viruses, and other proteins, such as lectins, in order to facilitate various crucial functions. It is also known that glycans are involved in a variety of biological processes including protein folding and signalling events.
The complex structure of glycans has been a bottleneck in the structure determination and thus data accumulation of glycan structures. This is confounded by the complex biosynthetic pathways of glycans. It is known that glycan-specific diseases called CDGs (congenital disorders of glycosylation) are caused by defects in these pathways
Complex carbohydrates are composed of monosaccharides that are covalently linked by glycosidic bonds, either in the α or β form. Unlike DNA and proteins, however, monosaccharides may be linked to one or more other monosaccharides, such that they form a branched tree structure. In order to formulate a standardized notation for glycans, the Consortium for Functional Glycomics (CFG) proposed a standard symbolic representation for those monosaccharides that are found most in nature, which has been employed in
Carbohydrates are most classically drawn as a tree in a two-dimensional plane, with the root monosaccharide placed at the right-most position and children branching out toward the left. Each node represents a monosaccharide, and each edge represents a glycosidic linkage, which includes the carbon numbers that are bound and the conformation. An example of an N-linked glycan is given in
Although the two-dimensional notation is nice and pretty, it is not suitable for storage in a database, let alone for bioinformatic analysis. The IUPAC–IUBMB (International Union of Pure and Applied Chemistry–International Union of Biochemistry and Molecular Biology) has specified the “Nomenclature of Carbohydrates” to uniquely describe complex oligosaccharides based on a three-letter code to represent monosaccharides (e.g., “gal” for galactose and “man” for mannose). Each monosaccharide code is preceded by the anomeric descriptor and the configuration symbol. The ring size is indicated by an italic
However, as we discuss in the next section, it is not always possible to obtain a full and exact representation of carbohydrates due to the difficulties in sequencing them. Currently, the most popular method for complex carbohydrate sequencing is mass spectroscopy (MS). However, this process is often incomplete and error-prone. For example, unless one uses MS in tandem it is nearly impossible to distinguish between isomeric monosaccharides (e.g., glucose, galactose, and mannose are all hexoses with the same mass). As any spectrometrist will state, MS in tandem is a rather tedious process, even for one carbohydrate structure. Thus, for those developing databases, the notation for carbohydrates must be flexible enough to capture all the data at hand but also be able to account for ambiguities.
There are currently in use several different notations for carbohydrates, which developed out of the construction of some major databases during a time when no standard notation for carbohydrates existed. Briefly, these notations are KEGG Chemical Function (KCF) format, which represents glycans using a connected graph, LINUCS (Linear Notation for Unique Description of Carbohydrate Sequences), which provides a unique and linear notation for glycans, and Linear Code by GlycoMinds, which provides a commercial complex carbohydrate database
As of the time of this writing, there are three major databases for complex carbohydrates, Glycosciences.de, KEGG GLYCAN, and the database developed by the Consortium for Functional Glycomics (CFG). All three databases are based on the CarbBank database developed in the 1990s by the Complex Carbohydrate Research Center (CCRC) at the University of Georgia
Database Name | Description | URL | Reference |
Glycosciences.de | Database of glycan structures and mass spectral data, based at the German Cancer Research Center | ||
KEGG GLYCAN | A part of the KEGG database containing glycan structures extracted from CarbBank and subsequently linked with the GENES and PATHWAY information in KEGG. Glycosyltransferases and glycan binding protein data have also been organized in KEGG BRITE | ||
CFG | Developed by the Bioinformatics Core of the CFG, this database contains structures from CarbBank and a seed database provided by GlycoMinds. They have been subsequently linked with tissue and cell data, glycan array information, and glycans specifically synthesized by the CFG. |
The major issue that was facing the glyco-informatics community was the fact that each of these databases represented their glycan structures in different formats. Glycosciencse.de uses the LINUCS format, KEGG the KEGG Chemical Function (KCF) format, and CFG the IUPAC format. In September 2006, a workshop was held at the National Institutes of Health (NIH), United States, where glycobiologists and glyco-informaticians gathered to discuss a standard exchange format for carbohydrate structures. At this meeting, the GLYDE-II XML format for glycans and glycoconjugates, developed by the CCRC, was agreed upon as the standard format for exchanging carbohydrate data
Along with the development of these glycan databases over the past few years, bioinformatic methods for analyzing glycan structures have also appeared. In general, these can be classified into the following six categories: glycosylation analysis, glycomics, glycan biomarker prediction, glycan structure analysis, glyco-gene expression analysis, and glycan structure mining.
In the area of research in the first three categories of glycosylation analysis, glycomics and glycan biomarker prediction may be of most interest to biologists, whereas the latter are (currently) active areas of research in the informatics community. Thus, the literature is rich in research in the former areas, and it is hoped that the latter areas will be able to develop and produce more interesting results as these technologies advance. In any case, these areas are all covered equally in this section.
Since the methods in this section have been summarized nicely in two previous reviews
As one form of post-translational modification, glycosylation affects the function of the modified protein. Thus, many methods have been developed to predict glycosylation sites based on the amino acid sequence. These methods have been summarized in
Name | Description | URL |
Big-PIPredictor |
GPI-anchor prediction | |
GlyProt |
In-silico glycosylation | |
GlySeq |
Statistical analysis of glycosylation sites | |
GPI-SOM |
Identification of GPI-anchor signals using a Self Organizing Map (SOM) | |
NetNGlyc |
N- and O-glycosylation prediction; also available as SOAP-based web services | |
NetCGlyc |
C-mannosylation site prediction from mammalian proteins | |
YinOYang |
Neural network predictions for O-β-GlcNAc binding sites in eukaryotic proteins, using predicted phosphorylation sites |
The statistical analysis of amino acids surrounding glycosylation binding sites has been an active area of research by the German Cancer Research Center. One of their tools called GlySeq
In addition to analyzing the surround sequence, a tool called GlyVicinity performs a statistical analysis of a PDB entry by computing the frequency of amino acids within a user-definable distance up to 10 Å of carbohydrate residues. This tool performs on top of the data in GlyVicinityDB, which contains distance information of the amino acids in the spatial vicinity of carbohydrate residues in PDB entries
In other work at Johns Hopkins University, a model to mathematically formulate N-glycosylation was developed
The field of glycomics can be defined as the technology to determine carbohydrate sequences (structures) using mass spectral data. This area of research has been the most desired by the glycobiology community due to the tedious process traditionally being used to characterize glycans and glycoproteins. In particular, each mass peak was manually annotated by experts, resulting in months of analysis for one mass spectrum.
This problem was conventionally solved by developing a database of theoretical mass spectra corresponding to known glycan structures. Thus newly produced MS data could be compared with the theoretical spectra to find the most similar one, thus providing a clue as to the structures behind the new spectra
More recently, as a result of the large volumes of MS data being produced by the CFG, the Cartoonist program was developed to automatically annotate N-glycans in MALDI-MS data
In an attempt to predict any type of glycan structure from mass spectra, the GLYCH method was developed to use a dynamic programming method and a listing of all possible fragment types of glycans
Many glycan motifs are known to be involved in a variety of diseases including cancer
In glycome informatics, the layered-trimer kernel was first developed and used to verify the utility of using kernels for glycan biomarker prediction
Taking advantage of the fact that the glycan substructures at the leaves are more prone to be recognized compared to the root structures attached to proteins, a weighting scheme was employed that differentiated substructures based on their “depth” or the “layer” of the substructure, the number of glycosidic linkages between the substructure and the root. Furthermore, it is known that glycosyltransferases interact with three monosaccharides on average. Thus, glycan structures were decomposed into trimers. This produced a feature vector of trimers distinguished by layer, which was tested using a dataset of glycans related to different blood components as well as to leukemic cells. These annotations were retrieved from the original CarbBank database.
The kernel was defined using a weighting parameter for the layer of each glycan substructure, according to the following equation. Given the feature vectors for two glycans
Using this kernel on the leukemia dataset described above, the model was able to extract a feature that was highly characteristic of leukemia, which was corroborated by experimental evidence.
This method extended the layered-trimer kernel in order to account for potential glycan biomarkers that were smaller or larger than trimers, without the use of layers, since it was assumed that layer information could be subsumed by the wider distribution of features. As a result, the
Finally, to more efficiently handle the large number of features required by the
The tree structure of glycans has been a topic of interest especially for bioinformaticians interested in trees. Traditionally, RNA structures and phylogenetic analyses have been the focus of tree-based algorithms. However, these structures result in trees with information at the leaves, with internal nodes representing relationships between the leaves. Thus, glycans have provided a structure where internal and external nodes all represent the same type of object: monosaccharides. As a result, glycan structure alignment using tree alignment algorithms and glycosidic linkage score matrices has been developed and analyzed.
The first application of tree-structure alignment using dynamic programming applied to glycans was the algorithm called KEGG Carbohydrate Matcher, or KCaM
This algorithm may now be used to analyze monosaccharide similarity, as in amino acid similarity, as represented by amino acid substitution matrices such as PAM
Once the appropriate classes of glycans are defined, the KCaM alignment results can be used to calculate the frequency of alignment of glycosidic linkages, which includes the full linkage information (carbon numbers and conformation), as well as the two monosaccharide names which are linked (hereafter called “links”). This score matrix of links is thus the log odds score of the expected frequency of alignment of link pairs
In an attempt to overcome one of the major issues in glycomics, glycan structure characterization through MS, a bioinformatic method to predict glycan structures in a particular cell through the gene expression profiles was developed
This method was further improved such that (i) the database of glycans were augmented with new glycans that should exist and (ii) the prediction score for glycans used the expression values directly as opposed to using binary values. The first step was performed by analyzing the database of glycans and finding those that differed by more than one link. That is, considering the fact that glycosyltransferases typically catalyze only one link at a time, if two similar glycans in the database existed, but differed by say two to four links, then “intermediate” glycans that should be catalyzed in the process of synthesizing the larger structure should also exist, and these “intermediate” glycans are added to the database.
Since Entry 2 contains just two more nodes than Entry 1, and since in almost all cases glycosidic linkages are synthesized one by one, we can assume that the New Entry exists and can be added as a new structure.
Lectins are known to recognize specific glycan structures, whose binding events trigger signalling processes to occur. However, oftentimes the specific structures being recognized are unknown. For example, siglecs are suspected to recognize patterns not only at the leaves of glycans but also further deeper in the chain
In order to retrieve the learned patterns directly from the model, a profile version of these models, called ProfilePSTMM, was subsequently developed to add insertion and deletion states in addition to the original match state. This model was tested on binding affinity data of galectins, which are known to recognize galactose residues, but had not been analyzed for longer patterns. In this experiment, a dimer structure was found to appear highly in the data, which was corroborated by experimental results
This tutorial briefly described several different bioinformatic methods for glycome research. With the further development of data resources and standards for data exchange, we hope that even better and newer methods to help understand the functioning of the glycome can be developed.
The author would like to dedicate this tutorial to Dr. Claus-Wilhelm von der Lieth of the German Cancer Research Center, with whom this tutorial was first presented at ISMB 2007 in Vienna, Austria. Dr. von der Lieth passed away in November 2007, leaving behind many great contributions to the field of glycomics.