Conceived and designed the experiments: JW WSN CL. Performed the experiments: IM. Analyzed the data: IM JW. Wrote the paper: IM JW WSN CL.
The authors have declared that no competing interests exist.
Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building
Searching a protein or DNA sequence database to find sequences that are evolutionarily related to a query is one of the foundational problems in computational biology. These database searches rely on pairwise comparisons of sequence similarity between the query and targets, but despite years of method refinements, pairwise comparisons still often fail to detect more distantly related targets. In this study, we adapt recent work from natural language processing to exploit the global structure of the data space in this detection problem. In particular, we borrow the idea of a semantic embedding, where by training on a large text data set, one learns an embedding of words into a low-dimensional semantic space such that words embedded close to each other are likely to be semantically related. We present the ProtEmbed algorithm, which learns an embedding of protein sequences into a semantic space where evolutionarily-related proteins are embedded in close proximity. The flexible training algorithm allows additional pieces of evidence, such as 3D structural information, to be incorporated in the learning process and enables ProtEmbed to achieve state-of-the-art performance for the task of detecting targets that have remote evolutionary relationships to the query.
Using sequence similarity between proteins to detect evolutionary relationships—protein homology detection—is one of the most fundamental and longest studied problems in computational biology. A protein's function is strongly correlated with its 3D structure, and due to evolutionary pressure, protein structures diverge much more slowly than primary sequences. Because protein sequence data will always be far more abundant than high-quality 3D structural data, the computational challenge is to infer evolutionarily conserved structure and function from subtle sequence similarities. When the evolutionary distance is large and the sequence signal faint—so-called
Stated in purely computational terms, remote homology detection involves searching a protein database for sequences that are evolutionarily related (even remotely) to a given query sequence. Most work in this area has focused on developing more sensitive pairwise comparisons between the query and target sequences, including sequence-sequence local alignments (BLAST
In the current study, we are motivated by large-scale learning of language models in recent work in natural language processing (NLP)
Here, we present an algorithm called P
We show that P
The main idea of our approach is to learn a mapping of protein domain sequences into a vector space that captures their “semantic similarity”, i.e. closeness in the semantic space should reflect homology relationships between sequences.
In order to learn an embedding of protein sequences into a semantic space, we need to define (i) a feature representation for proteins, (ii) a training signal that determines whether a given pair of training sequences are similar and should be pushed together by the algorithm, or dissimilar and should be pulled apart, and (iii) an algorithm that learns an appropriate embedding.
Let us denote the set of proteins in the database as
Next, we again use a surrogate protein alignment algorithm, this time as a
Given the feature vectors and the training tuples, our aim is to learn a feature embedding that performs well for protein ranking and classification tasks. We will learn an embedding function
After training, given a query protein
The training objective employs the margin ranking loss
This optimization problem is solved using stochastic gradient descent
One can exploit the sparsity of
After training, we precompute the embedding
In general, recognizing remote homology relationships among protein structures is easier than recognizing remote homologies based only on protein sequences. Although structural information is available for only a subset of the proteins in the database, we would like to ensure that our embedding captures this structural information in addition to the sequence-based information provided by PSI-BLAST. We consider two sources of structural information: (1) category labels for a given protein and (2) similarity scores between pairs of proteins. For the the category labels, we use the Structural Classification of Proteins (SCOP)
We incorporate this auxiliary information using the framework of multitask learning: in addition to the main embedding task, we simultaneously learn models to solve additional tasks using appropriate subsets of the training data. The tasks share internal representations learned by the algorithm, in this case, the embedding function
For labeled data—namely, proteins with structural category labels and 3D structures from which to compute pairwise similarity scores—we used proteins from the SCOP v1.59 protein database. We used ASTRAL
For unlabeled data, i.e. protein domain sequences without category labels or structural information, we used sequences from the ADDA domain database version 4
We ran PSI-BLAST version 2.2.8 on the combined SCOP+ADDA database using the default parameters, allowing a maximum of 6 iterations. For a second and more powerful pairwise sequence similarity method based on HMM-HMM comparisons, we also ran HHSearch version 1.5.0, using default parameters. HHPred/HHSearch is considered a leading method for remote homology detection
We first trained embeddings on SCOP+ADDA (with SCOP test sequences held out) using PSI-BLAST or HHSearch as the pairwise sequence comparison method to serve as “teacher” for producing
As an initial proof-of-concept test of the P
Visualization based on training P
To investigate the ability of P
Before training our embedding, we ran a series of cross-validation experiments within the training set to select
Each node corresponds to a homology detection algorithm, and the value associated with each node is the mean ROC
In evaluations of remote homology detection algorithms, some researchers prefer to ignore members of the same family as the query, since these family members are presumably easy to identify
Each panel shows a collection of ROC curves, produced by sorting into a single ranked list the results from all 97 test queries. Each series corresponds to a different algorithm. The panel on top (A) includes algorithms based on PSI-BLAST; the panel on the bottom (B) includes algorithms based on HHSearch.
Next, we evaluated how well P
The experiment reported in
To be useful, a homology detection algorithm must provide scores with well defined semantics. For example, PSI-BLAST reports an expectation value, or E-value, that corresponds to the number of scores as good or better than the observed score that are expected to occur in a random database of the given size
We cannot use these
For many users of alignment tools such as PSI-BLAST, the multiple alignment produced with respect to a given query is as useful as the rankings and accompanying E-values, because the multiple alignment provides an explanation of the ranking. However, a method like P
To illustrate how effective such a visualization can be, we systematically generated 2D maps of the neighborhood for all 97 test set domains, using a
Here, we focus on a single example.
(A) To visualize the effect of the embedding, we first show a query, the C-terminal domain of Staphylococcal enterotoxin B (PDB ID 3seb), mapped into a metric space according to the PSI-BLAST based feature representation used to initialize the embedding algorithm. (B) The query is now mapped into the final embedding space. In both panels, the query is labeled and indicated with a black circle. All members of the same SCOP family (superantigen toxins, C-terminal domain; SCOP 1.75 ID d.15.6.1), indicated with green triangles, are now in a tight cluster around the query and disambiguated from a distinct but functionally related SCOP superfamily (bacterial enterotoxins; SCOP 1.75 ID b.40.2), indicated with blue squares. Unrelated superfamilies are well separated from the query in the embedding space; members of unrelated SCOP superfamilies are indicated by various colored shapes, as labeled in the right panel.
We have shown that P
While alignment-based pairwise sequence similarity scores are used as features for calculating the embedding, P
The P
Protein sequence analysis is one of the oldest subfields of computational biology, with mature and specialized tools designed to describe the
Supplementary methods and results.
(1.69 MB PDF)