PKS, LJJ, and PB conceived and designed the experiments. PKS performed the experiments. PKS and SB analyzed the data. PKS contributed reagents/materials/analysis tools. PKS, LJJ, and PB wrote the paper.
The authors have declared that no competing interests exist.
Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term “alternative splicing” to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at
Given the functional complexity of higher eukaryotes, the relatively small number of genes in the human and other mammalian genomes came as a surprise to the scientific community. Later it was discovered that the majority of genes are subject to alternative splicing (“cutting and pasting”) or associated mechanisms that ultimately increase the diversity of transcripts that code for proteins. Studies exploring transcript diversity are currently dominated by high-throughput experiments and computational methods; however, the quality of such data should be assessed against a reliable reference set based on single-gene studies. Unfortunately, the latter type of information is scattered throughout the scientific literature. The authors have thus developed a computational approach for extracting information on alternative transcripts from MEDLINE abstracts and used it to create a database, LSAT. LSAT (Literature Support for Alternative Transcripts) provides information for more than 4,000 genes from about 14,000 abstracts. This database can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression based on single-gene studies, which we show agrees well with EST-based studies (these studies involve tissue-specific splicing detected by the analysis of libraries of expressed sequence tags [ESTs]). These results indicate that mechanisms like alternative splicing, alternative promoters, and alternative polyadenylation work in concert to generate and regulate transcript diversity. More generally, information extraction of complex biological process seems feasible and can also complement large-scale data generation in other areas to assign functions to genes.
Although many model organisms have now been completely sequenced, we are still very far from understanding cellular function from genome sequence. One complicating factor is the expression of multiple alternative mRNA transcripts from a single gene using different mechanisms. Alternative promoters that are active in different tissues or at different developmental stages often regulate the expression of different mRNA isoforms, either directly through different transcription start sites or indirectly by promoter-directed exon inclusion in concert with alternative splicing (AS) [
Generation of multiple alternative transcripts is important for the complexity and evolution of eukaryotic organisms [
At present, high-throughput experiments and computational analyses dominate the mapping of the alternative transcript universe [
Manual curation of experimentally determined biological events (physical interactions, AS, disease phenotypes, etc.) to generate trustworthy knowledge bases is slow compared to the rapid increase in the body of knowledge represented in the literature. Natural language processing tools thus play an increasingly important role in transferring information from free-form biomedical text to structured databases (see reviews [
IR can be performed at the level of full articles, pertinent paragraphs, or sentences. As current IE methods operate at the sentence level, it may be appropriate to perform IR at the same level. Support vector machines have become the method of choice for IR tasks because of their ability to learn patterns and generalize well while handling large sets of input features, a common attribute of the text data [
Here we present the benchmark and the results of a new extraction procedure that combines an SVM classifier with rule-based extraction of semantic patterns. The extracted knowledge about TD was stored in a database and subsequently used to quantify the amount of TD in different tissues. We discuss applications of our work for the assignment of MeSH terms (from the National Library of Medicine's Medical Subject Headings thesaurus), providing functional annotations to genes and to the transcript variants generated by computational methods.
To extract information about TD and associated spatiotemporal information scattered throughout MEDLINE, we devised a two-step procedure (
A database of physiologically occurring AS events can be generated in two steps. Each step may involve machine learning or rule-based methods. The first step involves the identification of sentences from scientific text. These sentences can be parsed in a second step to extract frequently occurring semantic patterns.
Finally, we mapped each abstract with information about alternative transcripts (retrieved by the SVM classifier) to entries in Swiss-Prot [
We identified eight different semantic categories describing biologically relevant data in the sentences describing TD, among which are event mechanism, species, tissue specificity, and experimental methods (
Extraction of Semantic Patterns
Our SVM classifier retrieved 31,123 putative TD-containing sentences from the MEDLINE database (12,948,515 abstracts). After false positives were removed by manual curation, 20,549 TD-containing sentences in 13,892 abstracts were left, corresponding to a precision of 66%. Details on the training set and SVM training procedure are described in
We determined the recall of the classifier using manually curated AS annotations from MEDLINE and Swiss-Prot for annotations on human, mouse, rat, and
Recall of the SVM Classifier
From the sentences retrieved by the SVM classifier, we extracted instances of eight semantic categories (see
Annotators at the National Library of Medicine have manually assigned the MeSH term “alternative splicing” to 8,133 abstracts. During the IE step, we identified 1,536 additional abstracts that mention AS but lack the MeSH term “alternative splicing,” corresponding to a 19% increase in annotation. We also identified DP and AP in 874 and 219 abstracts, respectively, for which we propose the new MeSH terms “alternative promoters” and “alternative polyadenylation” (
We also quantified the number of Ensembl genes for which we can propose new annotations for AS (see
The majority of vertebrate multi-exon genes undergo AS [
The extent to which various mechanisms are utilized for increasing TD may vary across different anatomical systems. To study this, we mapped all vertebrate tissue information to anatomical systems using the MeSH anatomy terms and counted the number of nonredundant events extracted for each mechanism in each system (
Nonredundant instances of AS, DP, and AP are plotted against anatomical systems in which expression was found. The color of each square in the top panel signifies the ratio of the number of events detected for the system to the highest number of events within the row. Total number of nonredundant instances for each mechanism is on the left. The bottom panel shows the negative logarithm of
The information about alternative promoter usage linked with specific gene names and tissues extracted in this study is the largest such collection available, to our knowledge. We expect that it would provide a reliable dataset for development of computational methods for predicting tissue-specific promoter usage.
AS has been shown to play an important role in creating functional specialization of tissues and development stages [
To study the extent of tissue-specific AS, we mapped tissues and organs to respective systems as described in the previous section and plotted the results (
The figure shows the body system distribution of differential/specific splicing. The instances were obtained from literature mining (left panel) and analysis of EST data ([
The knowledge extracted from the literature confirms EST-based studies [
Sometimes experimental biologists speculate about the mechanism responsible for the multiple transcripts observed with a limited number of experiments but the corresponding transcripts are not deposited in GenBank. For example, work by Pisarra et al. [
This figure shows a database entry that derives very little functional annotation from sequence databases. Text extraction rules were successful in identifying gene name, tissue, and event mechanism for the
On the other hand, various methods, including those based on aligning EST and other sequence data to genomic regions, are currently in use for detecting AS on a large scale. The function of the isoforms thus generated is largely unknown [
Using the heaviest bundling algorithm [
We successfully extracted information about the genes that express multiple transcripts and associated spatiotemporal information using state-of-the-art methods in natural language processing and utilized it for function annotations. The information extracted by far exceeds current manual curation efforts and generates reliable results. Our results indicate that mechanisms like AS, DP, and AP work in concert for the generation and regulation of TD. They also suggest that the nervous system preferentially relies on AS over other mechanisms to express the largest set of tissue-specific transcripts. In contrast, genitalia and the digestive system more frequently make use of alternative promoter regions. The knowledge stored in the database about synergy and preference for TD-generating mechanisms across tissues will be integrated to high-throughput data in the future. More generally, IE of complex biological processes seems feasible and can also complement large-scale data generation in other areas to assign function.
A set of 4,240 sentences describing physiological TD and 13,520 negative sentences were selected as a training corpus from article titles and abstracts. Sentences describing mutations, clinical studies involving patients, nucleotide transversions, and splicing mechanisms were considered negative sentences. Sentences describing natural gene expression, gene paralogs, and aberrant transcripts showed word usage similar to those describing TD, making the classification task more challenging. Description of the learning corpus can be found in
The text in all the abstracts was split into sentences using the Oak system (S. Sekine, unpublished data;
The procedure of inductive learning (see
The classifier was trained to extract only the natural TD from the written text, as contrasted by aberrant transcripts that are caused by genetic changes. For consistency, we removed the 2,767 of the 8,133 MEDLINE entries with the MeSH term “alternative splicing” that also had the MeSH term “mutation,” had no abstract text, or had erroneous assignment of the MeSH term “alternative splicing.”
Precision and recall are used in IR to measure the performance of methods and they are defined as below.
Where, TP, TN, FP, and FN denote true-positive, true-negative, false-positive, and false-negative elements according to a classification criterion.
An event or a scenario is described in a sentence via the combination of a predicate (normally a verb) and its arguments [
We constructed semantic patterns similar to those described in the PASBio database of predicate–argument structure [
The success in assigning gene, species, and event mechanisms to abstracts is as follows (
To quantify the gain in gene annotation, first we mapped sequence information to the MEDLINE identifiers from the SVM classification using literature entries in Swiss-Prot, RefSeq, and GenBank. Second, we mapped sequence-containing entries for human, mouse, and rat genes present in our results and in those databases to Ensembl gene identifiers using the EnsMart system. Then we compared our annotations to those of Swiss-Prot and RefSeq to identify genes that were missed during the manual curation of AS. Special care was taken to avoid annotations that may have arisen because of a single literature entry mapping to multiple database entries. Hence, these annotations were highly significant.
The significance of the association of each TD-generating mechanism with each organ system was evaluated using the hypergeometric distribution. We corrected these
(1.7 MB TIF).
(4.6 MB TIF).
(60 KB PDF).
(112 KB PDF).
(423 KB TXT).
(120 KB TXT).
(445 KB XLS).
(76 KB XLS).
(20 KB XLS).
(29 KB XLS).
Authors would like to thank Yi Xing and Dr. Christopher Lee for providing the code for SPLICE-POA and the isoform generation algorithm.
alternative polyadenylation
alternative splicing
differential promoter usage
information extraction
information retrieval
support vector machine
transcript diversity