CV, SAT, and XdlC conceived and designed the experiments and wrote the paper. DT, CV, and MO performed the experiments. DT, CV, MO, SAT, and XdlC analyzed the data.
The authors have declared that no competing interests exist.
Alternative splicing (AS) and gene duplication (GD) both are processes that diversify the protein repertoire. Recent examples have shown that sequence changes introduced by AS may be comparable to those introduced by GD. In addition, the two processes are inversely correlated at the genomic scale: large gene families are depleted in splice variants and vice versa. All together, these data strongly suggest that both phenomena result in interchangeability between their effects. Here, we tested the extent to which this applies with respect to various protein characteristics. The amounts of AS and GD per gene are anticorrelated even when accounting for different gene functions or degrees of sequence divergence. In contrast, the two processes appear to be independent in their influence on variation in mRNA expression. Further, we conducted a detailed comparison of the effect of sequence changes in both alternative splice variants and gene duplicates on protein structure, in particular the size, location, and types of sequence substitutions and insertions/deletions. We find that, in general, alternative splicing affects protein sequence and structure in a more drastic way than gene duplication and subsequent divergence. Our results reveal an interesting paradox between the anticorrelation of AS and GD at the genomic level, and their impact at the protein level, which shows little or no equivalence in terms of effects on protein sequence, structure, and function. We discuss possible explanations that relate to the order of appearance of AS and GD in a gene family, and to the selection pressure imposed by the environment.
Alternative splicing (AS) and gene duplication (GD) followed by sequence divergence constitute two fundamental biological processes contributing to proteome variability. The former reflects the ability of many genes to express different products, while the latter results in several copies of the same gene that are similar but not identical. In spite of these obvious differences, recent computational studies as well as anecdotal experimental evidence suggested that AS and GD produce functionally interchangeable protein variants. We provide a detailed study of the differences between alternative splicing and gene duplication and discuss potential interchangeability with respect to variation in expression, protein structure, and function. In general, the contribution of these two processes to the proteome variability is substantially different, and we advance some explanations that may explain this apparent contradiction and contribute to our understanding of the evolution of complex, eukaryotic proteomes.
Alternative splicing (AS) and gene duplication (GD) are two main contributors to the diversity of the protein repertoire with enormous impact on protein sequence, structure, and function [
(A) The alignment shows an example of molecular equivalence between the effects of AS and GD. The human U2AF35 gene has two known splice variants, Hs_U2AF35a and Hs_U2AF35b, that differ along the region marked with a red box. The fugu orthologue Fr_U2AF35-a does not have known splice variants, but instead has a paralogue, Fr_U2AF35-b [
(B) We compared the characteristics of two types of sequence changes, indels and substitutions, between AS (both shown in dark blue) and GD (shown in dark and light blue). On top, we illustrate an indel event (the deleted stretch is highlighted in red, and two dotted lines denote its location); at the bottom, we illustrate substitution events (red lines represent residue matches between sequences, linked by dotted lines; the continuous lines between alternative splice isoforms represent the boundaries of the interchanged stretches).
(C) We used this protocol in all sequence comparisons between AS and GD. Changes between alternative splice isoforms are obtained after comparing the SwissProt [
Further, the changes introduced to a sequence are constrained by the need to preserve a stable and functional three-dimensional (3-D) fold [
Here, we first tested the anticorrelation between AS and GD with respect to sequence divergence, function, and gene expression. Second, we studied the interchangeability hypothesis at the protein structure level and asked to what extent AS and GD introduce changes to the sequence that are equivalent in their nature and effect on structure and function. To this end, we conducted a large-scale comparison of the effects of AS and GD on human and mouse proteins (
In accordance with recent findings [
(A) The diagram shows the uneven distribution of AS amongst GD families of different sizes for the human genome. Information on AS has been taken from the AltSplice database [
(B) The cartoons illustrate that alternative splice isoforms and gene duplicates may be expressed in the same number and/or types of tissues. Here, we compared the extent of coexpression amongst alternative splice variants (AS coexpression) and gene duplicates (GD coexpression).
(C) Coexpression levels amongst gene duplicates (GD coexpression) are estimated as the average pairwise PC between expression patterns of all genes within a GD family. GD coexpression amongst duplicates of >40% seq.id. (white diamonds) is more similar to the overall AS coexpression (red line indicating the value displayed in
As this dataset [
(D) Coexpression levels amongst alternative splice variants (AS coexpression) are estimated as average pairwise PC between the expression patterns of all exon junctions of a gene. High PC indicates little variation (high coexpression), and vice versa. The figure shows average AS coexpression across all genes in the dataset [
The anticorrelation between AS and GD could arise from the preferential duplication or introduction of AS in genes of particular function. In general, genes with AS have similar distributions across functional categories as genes with GD (see
When removing from our dataset the 855 and 293 proteins predicted to be G protein–coupled receptor-like or ribosomal proteins [
We further characterized the relationship between AS and GD by comparing their patterns of expression among different tissues, which reflect corresponding regulatory processes. A previous study reported that AS and general transcription regulation act independently on different groups of genes with tissue-specific expression [
The level of coexpression was measured using the Pearson correlation coefficient (PC) between the expression patterns of two isoforms among a set of tissues, in the case of AS, or of two duplicates, in the case of GD. When more than two isoforms (or duplicates) were available, we averaged the PCs resulting from all the possible comparisons between them (see Material and Methods and
We directly compared the coexpression in alternative splice variants (AS coexpression) and in gene duplicates (GD coexpression) using data from the same microarray experiment [
We also explored whether we can observe, at the expression level, an anticorrelation in analogy to that found at the sequence level [
We note that the coexpression analysis is at present still limited by the amount of data available. Future availability of large-scale datasets suited for expression comparison of AS and GD will help refine our results.
To test whether AS and GD are interchangeable at the structural and thus functional level, we compare sequence changes between duplicate proteins to those between alternative splice isoforms (
We focus on gene families defined using two seq.id. thresholds: 80% and 40%. The former was chosen because the anticorrelation at the genomic level is stronger (
First we examine substitutions, i.e., the extent and nature of amino acid changes and the length of the substituted region. In general, substitutions amongst GD range from a small number of amino acid replacements in recent homologues to a large number of replacements in proteins as divergent as haemoglobin, for instance [
Global seq.id. is the seq.id. along the whole alignment of two sequences, and it can be used to assess the overall degree of function conservation between proteins [
In
AS data were obtained by querying SwissProt [
(A) Global seq.id. The seq.id. in GD families depends on the cutoff used for clustering, e.g., GD40 (dark red) or GD80 (light violet), respectively. The global seq.id. between alternative splice isoforms (light green) is very high ( >90% seq.id.), reflecting the underlying nature of AS changes.
(B) Local seq.id. in alternative splice isoforms (dark green) is measured between substituted stretches, usually arising from mutually exclusive exons. The local seq.id. between gene duplicates is obtained using a moving window (GD80: light violet, GD40: dark red) and reporting the seq.id. observed in all possible window positions.
(C) Local seq.id. in AS and GD at equivalent positions. The graph compares local seq.id. found in alternative splice variants of a gene with the local seq.id. of a duplicate of the same gene. The AS local seq.id. was computed between substituted sequence stretches. For GD, we mapped the sequence positions of the AS event to the aligned GD, and computed the seq.id. between the GD, considering only the aligned positions within that region. The comparison is shown for AS and GD40 (red) and GD80 (blue), respectively.
The diagonal separates the plot into two halves: the upper half corresponds to the region for which GD seq.id. is higher than that for AS; the lower half corresponds to the opposite. For both types of gene families (GD40 and GD80), most substitutions show higher seq.id. amongst gene duplicates than amongst alternative splice variants, and this bias is significant (GD80: 111 of 142, χ2 test
(D) Local seq.id. in AS−/GD+ and AS+/GD− substitutions. To compute local seq.id. in AS−/GD+ families, we first align two GD, then slide a 100-aa window over the sequence of one protein, and compute the seq.id. at all sequence positions of the window. The results of all the possible comparisons are plotted for GD40 (dark red) and GD80 (light violet) families. For genes with AS but no duplicates (AS+/GD−) (dark green), local seq.id. was computed between the two substituted stretches resulting from AS events. As for AS+/GD+ families (
In contrast, while the local seq.id. in alternative splice variants is usually low, it is clearly higher for GD, in particular for GD80 families (
Global and local seq.id. provide a first view on interchangeability between AS and GD. However, to understand the effects of substitutions, it is also important to know the location of the changes [
In general, we find that the distribution of amino acid replacements along the protein sequence is different between AS and GD substitutions. In gene duplicates, changes are spread all over the sequence, while in alternative splice variants the changes are concentrated at very precise locations, in accordance with the underlying molecular mechanisms [
The comparisons described above have been derived from gene families with alternative splice variants and gene duplicates (AS+/GD+,
To complete our analysis of substitutions, we compare the nature of the amino acid replacements in AS and GD, focusing on nonconservative replacements. These involve amino acids of very different physico–chemical properties, and are thus more likely to alter protein structure, stability, and function [
Further, we use the maximal distance between replacements as a measure of the distribution of nonconservative changes along the sequence. We find clear differences between AS and GD (
The maximal mismatch distance between nonconservative substitutions is much smaller in AS than in GD. The maximal mismatch distance is the number of residues between the two most distant, nonconservative substitutions, normalized by sequence length. Nonconservative mismatches have a negative value in the Blosum62 matrix [
The example of mitogen-activated protein kinase 9 (MAPK9). The example of human MAPK9 illustrates how differences between AS and GD in the distribution of sequence changes result in different distributions of physico–chemical properties across the 3-D structure. The original structure of MAPK9 was homology-modelled after MAPK10 and is shown in blue; the residue changes are indicated following a colour scale related to the associated difference in hydrophobicity (we use the absolute value of the difference in order to avoid too many colours; the colour scale goes from blue to red, where the latter corresponds to the largest change). For comparison purposes, the location of the AS changes in the three structures is indicated by a yellow box. As a hydrophobicity measure, we used the free energy of water to octanol transfer [
(A) Alternative splice isoforms of MAPK9.
(B) Gene duplicates of high seq.id. (MAPK10; isoform alpha2, 84% seq.id. to MAPK9).
(C) Gene duplicates of medium seq.id. (MAPK13; 46% seq.id. to MAPK9).
We observe, in accordance with the results from the sequence analysis, that while AS changes are located at a very specific location, GD changes are spread all over the protein surface. As expected, the number of changes between MAPK9 and MAPK13 is the largest. Neither one of MAPK9′s paralogues (MAPK10 and MAPK13) shows a set of residue changes identical to that in the alternative splice variant.
In summary, AS replacements are generally more concentrated and physico–chemically drastic than those observed for GD. The nature of these differences argue against interchangeability between AS and GD, on the basis of existing studies [
Second, we studied indels, which modify protein structure in a different way than substitutions. A first and intuitive measure of their impact is provided by indel size: small indels are more likely to have a small effect on structure than larger ones. We find that indel sizes are substantially different for AS and GD (
All analyses of indels have been made for gene families with both AS and GD (i.e., AS+/GD+).
(A) AS indels are longer than GD indels. Indels for GD were obtained from the alignments of GD families at 40% (dark red) and 80% (light violet) seq.id. Information on AS indels (green) was obtained from the SwissProt record of the corresponding protein. Indel size distributions for both GD40 and GD80 are very similar, with most of the indels being shorter than five residues. In contrast, many AS indels are longer than 100 residues.
(B,C) Size distribution for external and internal indels in AS and GD. External indels (B) lie at the N- or C-terminal ends of the protein; internal indels (C) lie in the middle. AS and GD40 indel sizes are different depending on the position of the indels in the sequence. While AS indels are generally larger than GD indels (also see
Given that, in general, sequence changes are severely constrained by structure and stability requirements [
We separated the AS and GD indels according to their location in sequence (N-/C-terminal ends or internal) and plotted the corresponding size distributions (
We also examined the overlap between the location of indels in splice variants and in duplicates of the same gene (
The overlap between AS and GD indels is very small. For the frequency distribution of the overlap between AS and GD indels, AS indels were taken as reference. GD data at 80% seq.id. are shown in light violet, while GD data at 40% seq.id. are shown in dark and light blue for both all indels and only short indels (≤30aa), respectively. Given the small overlap, AS and GD indels are likely to affect different locations in protein structure.
There is also a small fraction of instances (~15%) with an obvious overlap between AS and GD indels (
In summary, the results obtained in the study of indels lead to the same conclusions as for substitutions: in general, the impact of AS and GD on protein function is not interchangeable, irrespective of whether we consider GD at the 80% or 40% seq.id. levels.
AS and GD are anticorrelated at the genomic level (
Summary of the Effects of Alternative Splicing and Gene Duplication on Sequence and Structure
To explain the apparent paradox between the relationship of AS and GD at the genome and at the protein level, we speculate on alternative explanations for the depletion of AS observed in large GD families (
Finally, if a gene with AS has duplicated, subsequent loss of an isoform in one of the copies may be tolerated due to the existence of an identical version of this isoform in the other copy of the gene. This explanation is supported by recent findings on the evolution of AS upon GD [
A combination of these effects would, in general, result in a smaller proportion of genes with AS in gene families with more than one duplicate, in particular for recent duplicates, suggesting that the chronological order of events plays a role. Subsequent divergence of the gene duplicates may alleviate the negative impact of the dosage balance effect, allowing the evolution of AS and reducing the anticorrelation between AS and GD.
To test the relationship between GD and AS at the genomic level, we used: (i) clusters of homologous sequences inferred from the seq.id. (equivalent to gene families); (ii) sets of known or predicted alternative splice variants, or isoforms, for a particular gene. An overview of the data is provided in
In addition, we compared the findings using the AltSplice dataset with findings using data from the SwissProt database [
To estimate GD, we used the Ensembl protein predictions for human (version 37.35j) and mouse (version 37.34e) (
Retrotransposition creates gene duplicates that have only a single exon, and hence are unlikely to show any AS. To test for a possible bias in GD families (GD+) stemming from retrotransposition, we examined the distribution of single-exon genes across AS and GD sets using the SEGE database [
To determine whether the inverse relationship between AS and GD is specific to genes of particular functions, we analyzed functions for human proteins of the four different sets of genes with or without AS or GD (AS+/GD+, AS+/GD−, AS−/GD+, AS−/GD−). We analyzed gene functions from both the SwissProt and the AltSplice database using the DAVID Web server (
In this part of our work, we compared the extent of coexpression between alternative splice isoforms (AS coexpression) with that of coexpression between gene duplicates (GD coexpression). To estimate AS coexpression across human genes, we analyzed data on absolute expression levels of exon junctions of 3,840 human genes, measured across 44 different tissues [
The expression of all exon junctions of a particular gene was then summarised (averaged) to form one vector representing the overall expression pattern of a gene. We measured GD coexpression of a GD family, i.e., the amount of coexpression amongst gene duplicates, as the average pairwise PC between the gene expression vectors of all family members. For gene families of >80% seq.id., the 3,840 genes in the dataset [
We assessed GD at the whole-gene level in which two proteins are assigned to the same gene family when their seq.id. is above 40%, or above 80% (GD40 or GD80, respectively). These families were obtained by clustering the SwissProt [
Our set of genes with AS was obtained after querying the SwissProt database [
SwissProt [
For all of the distributions shown in the different figures, we computed the confidence interval corresponding to each proportion in the distribution, following Goodman [
Our work involves detailed comparison between AS and GD sequence changes which occur between alternative splice isoforms or between gene duplicates (
Local seq.id. refers to the seq.id. between parts of the sequences. Local seq.id. between alternative splice isoforms was always computed in the same way, comparing the sequence stretches substituted between them. To this end, we first obtained the location of these stretches from SwissProt [
To obtain local seq.id. between gene duplicates, we distinguished two cases (
In AS+/GD+ cases, we followed two different procedures. The first one uses a sliding window of the size of the AS substitution,
In the second procedure analysing the AS+/GD+ case (
In AS−/GD+ cases, a direct comparison between AS and GD local seq.id. is no longer possible. Here, local seq.id. was estimated using a moving window of size
In the case of AS, the maximal distance between mismatches was computed using the same equation as before:
Note that the maximal mismatch distance was normalized by the sequence length to allow comparison of all results independent of the protein size.
For all the variables, we conducted a comparison between AS and GD as described before (
In the case of the indel size distribution for GD the procedure was: (i) for each gene family in our dataset (see above) we obtained the length of all indels (gaps) for all the possible alignments between the proteins in the family, and (ii) the resulting lengths were binned after a simple redundancy correction. The redundancy correction consisted of dividing the contribution of each indel in the frequency histogram by the number of sequences in the family cluster. The resulting distribution follows a power law very similar to that previously found by Benner and colleagues in a massive alignment experiment [
Both the AS and GD indels datasets were subsequently broken down in two subsets, according to whether indels were external (positioned at the N- or C-terminal ends of the protein sequence) or internal (positioned within the protein sequence). The resulting length distributions are shown in
As mentioned before, the SwissProt database [
Distribution of human sequences across GD families as determined by different seq.id. cutoffs (40%, 60%, 80%, 90%). GD families of size 1 denote singletons, i.e., genes without paralogues (GD−).
(54 KB PPT)
We tested for functional biases across proteins with AS and/or GD using the GO annotation available for humans from the GO database [
Human genes were annotated with respect to biological process (A) and molecular function (B) using GO annotation [
(749 KB PPT)
We show the fraction of duplicated genes per gene family that have different chromosomal location, using a 40% seq.id. cutoff (dark red). (Data for GD80 families are not shown because of the small amount of data.) In all except one group of families, on average >55% genes within a family have different chromosomal locations. This indicates different regulation between duplicates [
(55 KB PPT)
The four figures reproduce, for mouse, the analysis shown in
(A) Substitutions in AS have different effects on global versus local seq.id. Light and dark green correspond to global and local seq.id. for AS substitutions, respectively. Global seq.id. is obtained after aligning two isoforms for the same gene, for which the AS event involved a substitution. Local identity applies only to the substituted stretches. Dark red corresponds to the seq.id. distribution for GD families at 40%, after sequence alignment between paralogues. The global seq.id. between splice isoforms is very high while the local seq.id. in alternative splicing variants is very low. Both seq.id. distributions for AS contrast with those of GD families.
(B) Maximal mismatch distance between nonconservative substitutions is much smaller in AS than in GD. The maximal mismatch distance is the number of residues between the two most distant, nonconservative substitutions, normalized by whole sequence length. Nonconservative mismatches have a negative value in the Blosum62 matrix and were chosen for their stronger impact in protein structure and function. The plot depicts AS data in green, and GD data for families at 40% seq.id. in dark red. Substitutions in alternative splice variants are much more localized than those in gene duplicates.
(C) Size distribution for indels. The AS distribution is shown in green. Indels for GD are shown for the whole-gene model (dark red). Clear differences are found between both distributions.
(D) Frequency distribution of the amount of overlap between AS and GD indels, taking as reference the sequence of the AS indel (see
(1.1 MB PPT)
To provide another definition of gene families, we estimated GD families based on domain families. We used domain annotations from the Pfam database [
(A) Global seq.id. distribution. The distribution of human AS sequences is shown in green; for GD whole-gene families (40% level) are shown in dark red; indel sizes for GD families defined by Pfam domains are shown in light red. We observe that the range of seq.id. for the latter is much lower than for AS and GD whole-gene families. At the local level (results not shown) the range of seq.id. for the Pfam model of GD is lower than that observed for AS. However, for the former the amino acid replacements spread over the whole sequence, contrary to what we observe for AS.
(B) The indel size distribution of human AS sequences is shown in green. Indel sizes for GD whole-gene families (seq.id. cutoff of 40%) are shown in dark red; indel sizes for GD families defined by Pfam domains are shown in light red. In the former, whole sequences were compared within each family to obtain the indel size distribution. In the domain-based GD families, indels were obtained from the multiple sequence alignments of the Pfam databank [
(C,D) Size distributions for external and internal indels, respectively, with the same colour code as in (B). These distributions indicate that indels from Pfam domains and GD families show similar trends when compared with AS indels. Overall, our results indicate that GD and AS are in general different in their sequence/structure changes, independently of the model representing GD.
(1.1 MB PPT)
No significant differences are found between the original results and those obtained after eliminating from the AS dataset all the isoforms that may be targets of NMD machinery [
(A) Overall versus local seq.id. Original AS global and local seq.id. are shown in light and dark green, respectively. Overall and local seq.id. for NMD-filtered AS are shown in orange and yellow, respectively.
(B) Maximal mismatch distance between nonconservative substitutions. Original AS data are shown in dark green, NMD-filtered data are shown in orange.
(C) Indel size. Original AS data are shown in dark green, NMD-filtered data are shown in orange.
(D) Overlap between AS and GD indels. Original data are shown in violet, dark blue, and light blue, while the corresponding NMD-filtered data are shown in yellow, orange, and light green.
(1.3 MB PPT)
To exclude biases in our results introduced by the use of the SwissProt database [
(35 KB PPT)
To obtain the number of exons per gene, we followed the procedure employed by Saxonov and colleagues to build the EID database [
Three distributions show the number of exons per gene, corresponding to the following cases: singleton genes with AS (AS+/GD−, dark green); genes that are both duplicated and have AS (AS+/GD+, light green), and duplicated genes with no AS (AS−/GD+, dark blue). The results are obtained for gene families at both the 80% level (A) and the 40% level (B). In both cases we see that there is a trend for AS−/GD+ to have a smaller number of exons than AS+/GD+ and AS+/GD− genes.
(525 KB PPT)
We describe the two procedures followed to compute the local seq.id. between duplicates (see
(A) The first procedure is based on the use of a moving window the size, N, of the AS event. The window is moved along the aligned sequences of both duplicates, and at each position the seq.id. between them is computed (within the limits of the window).
(B) In the second procedure, we first aligned the sequence of the protein with known splicing to one of its duplicates. The former was always the sequence of the SwissProt [
(462 KB PPT)
(A) Illustrates the basic comparisons of coexpression, whose results are shown in
(B) In the datasets published by Johnson et al. [
For each gene family, we can produce a second matrix of gene expression patterns of the duplicates across different tissues. We estimate GD coexpression by analyzing the variation of expression values in each gene family's matrix. GD coexpression was analyzed for the dataset by Johnson et al. [
We tested the following measures for analysis of coexpression. (i) The average pairwise PC. We calculated average PC between each pair of row vectors in the AS or GD matrix. PC close to 0 indicates no correlation in expression between exon junctions (representing AS) or gene duplicates, respectively. PC close to 1 indicates strong correlation between the row vectors and is indicative of little AS or differential expression amongst gene duplicates. (ii) The number of
While matrices in the figure show binary expression data, calculations were done on both raw and binary data. All results are similar irrespective of the cutoff for binarization (600 or 150). They are also similar irrespective of the cutoff for gene family definitions (40%, 60%, or 80% seq.id.) or of the underlying AS+ datasets employed (SwissProt or AltSplice).
(746 KB PPT)
An anticorrelation between AS and GD [
(66 KB PPT)
Provides an overview of the genomic data from the Ensembl database (human release 37.35j, mouse release 37.34e) [
(54 KB DOC)
Retrotransposition produces duplicates that consist of only one exon. To test for possible bias in families of gene duplicates (GD+) stemming from retrotransposition, we examined the distribution of single-exon genes across AS and GD sets using the SEGE database [
(53 KB DOC)
The table lists a selection of functions as obtained from the DAVID Web server [
All function annotations are significantly different from the background (E-value < 10−10). We removed redundant annotations and annotations that were too broad to be meaningful (e.g., “binding”). Duplication of particular gene families that are depleted in AS, such as ribosomal proteins or some receptors, has contributed to the inverse relationship between AS and GD, but cannot explain it completely.
(96 KB DOC)
(375 KB DOC)
The table shows the number of genes with AS, and the number of multiple gene families, together with the respective number of sequences. Information on AS was taken from SwissProt [
(50 KB DOC)
The accession numbers used in this paper are from Swiss-Prot (
The authors are grateful to M. Brandl, J. Castresana, C. Chothia, K. Hannay, D. Kramer, M. A. Martínez-Balbás, A. Ortiz, J. Pereira-Leal, J. Valcárcel, and C. Voelckel for helpful comments on the manuscript, and H. Dopazo, J. Dopazo, and N. L. Barbosa-Morais for useful discussions. We are grateful to the SwissProt team for their support. We thank M. Carmo-Fonseca and T. R. Pacheco for kindly providing the U2AF35 sequences from their analysis. We are grateful to the anonymous reviewers whose suggestions led to valuable additions to our work. CV acknowledges funding by the Boehringer Ingelheim Foundation, the Medical Research Council, and the International Human Frontier of Science Program. XdlC and DT acknowledge funding from the Spanish government (grant BIO2003–09327).
amino acid(s)
alternative splicing
gene without known splice variants
gene with splice variants
gene duplication
gene without duplicates (as inferred from lacking sequence similarity to other sequences)
gene with duplicates (inferred from sequence similarity to other sequences)
gene families clustered at >40% sequence identitity
gene families clustered at >80% sequence identity
insertion/deletion
nonsense-mediated decay
Pearson correlation coefficient
sequence identity