Skip to main content
Advertisement

< Back to Article

Figure 1.

Equivalence between Alternative Splicing and Gene Duplication

(A) The alignment shows an example of molecular equivalence between the effects of AS and GD. The human U2AF35 gene has two known splice variants, Hs_U2AF35a and Hs_U2AF35b, that differ along the region marked with a red box. The fugu orthologue Fr_U2AF35-a does not have known splice variants, but instead has a paralogue, Fr_U2AF35-b [9]. All sequences have kindly been provided by T. R. Pacheco and M. Carmo-Fonseca. For some residues (bold, highlighted in light blue), the substitutions amongst the human splice variants are equivalent to those in the fugu GD. The cartoon illustrates the relationship between the human and fugu sequences. The names of genes and their protein products are denoted in small and capital letters, respectively. At the molecular level, AS and GD show equivalent changes to sequence, and therefore are likely to have interchangeable effects on structure and function of the proteins. In this work we study whether such molecular interchangeability holds in general.

(B) We compared the characteristics of two types of sequence changes, indels and substitutions, between AS (both shown in dark blue) and GD (shown in dark and light blue). On top, we illustrate an indel event (the deleted stretch is highlighted in red, and two dotted lines denote its location); at the bottom, we illustrate substitution events (red lines represent residue matches between sequences, linked by dotted lines; the continuous lines between alternative splice isoforms represent the boundaries of the interchanged stretches).

(C) We used this protocol in all sequence comparisons between AS and GD. Changes between alternative splice isoforms are obtained after comparing the SwissProt [44] reference isoform with the remaining isoforms. Changes between duplicates are obtained by comparing the SwissProt [44] reference isoforms of the genes that are part of one GD family.

More »

Figure 1 Expand

Figure 2.

The Relationship between AS and GD at the Genomic Level

(A) The diagram shows the uneven distribution of AS amongst GD families of different sizes for the human genome. Information on AS has been taken from the AltSplice database [43]. GD families were obtained by clustering all sequences of more than 40%, 60%, 80%, or 90% seq.id., respectively, using CD-HIT [47]. The dashed line marks the expected fraction of genes with AS, given an unbiased distribution of all known genes with splice variants across the whole genome. In accordance with previous results [12,13], for large GD families we observe fewer genes with AS than expected at random.

(B) The cartoons illustrate that alternative splice isoforms and gene duplicates may be expressed in the same number and/or types of tissues. Here, we compared the extent of coexpression amongst alternative splice variants (AS coexpression) and gene duplicates (GD coexpression).

(C) Coexpression levels amongst gene duplicates (GD coexpression) are estimated as the average pairwise PC between expression patterns of all genes within a GD family. GD coexpression amongst duplicates of >40% seq.id. (white diamonds) is more similar to the overall AS coexpression (red line indicating the value displayed in Figure 2D) than GD coexpression amongst duplicates of >80% seq.id. In other words, coexpression of alternative splice variants is similar to coexpression amongst gene duplicates of >40% seq.id.

As this dataset [17] is too small for GD80 families to be split into further subsets, we examined GD coexpression in an additional dataset [53] (black diamonds). For both 40% and 80% seq.id., expression variation amongst gene duplicates with alternative splice variants (AS+) is slightly higher than variation amongst gene duplicates without alternative splice variants (AS−). p-Values are based on t-test calculations. Data on alternative splice variants was taken from the AltSplice database [43]. Further details and results are provided in Table S4 and Figure S10A and S10B.

(D) Coexpression levels amongst alternative splice variants (AS coexpression) are estimated as average pairwise PC between the expression patterns of all exon junctions of a gene. High PC indicates little variation (high coexpression), and vice versa. The figure shows average AS coexpression across all genes in the dataset [17], and across subsets of the genes: GD families (GD+) and singletons (GD−) as defined by >40% and >80% seq.id., respectively. The overall AS coexpression is marked as a red diamond and indicated as a red line in Figure 2C. Further details are provided in the Table S4 and Figure S10A and S10B. p-Values are based on t-test calculations. Gene duplicates of high seq.id. (>80%) have slightly lower AS coexpression than singletons (p-value < 0.001).

More »

Figure 2 Expand

Figure 3.

Global and Local Sequence Identity in AS and GD Substitutions

AS data were obtained by querying SwissProt [44] database version 40, with the keywords VARSPLIC and HUMAN. GD data were obtained by clustering the SwissProt [44] data using CD-HIT [47] to 40% or 80% seq.id. (GD40 and GD80, respectively). We focus on AS+/GD+ cases, i.e., those sequences with both AS and GD, in Figure 3A–3C, and discuss the AS−/GD+ versus AS+/GD− case in Figure 3D.

(A) Global seq.id. The seq.id. in GD families depends on the cutoff used for clustering, e.g., GD40 (dark red) or GD80 (light violet), respectively. The global seq.id. between alternative splice isoforms (light green) is very high ( >90% seq.id.), reflecting the underlying nature of AS changes.

(B) Local seq.id. in alternative splice isoforms (dark green) is measured between substituted stretches, usually arising from mutually exclusive exons. The local seq.id. between gene duplicates is obtained using a moving window (GD80: light violet, GD40: dark red) and reporting the seq.id. observed in all possible window positions.

(C) Local seq.id. in AS and GD at equivalent positions. The graph compares local seq.id. found in alternative splice variants of a gene with the local seq.id. of a duplicate of the same gene. The AS local seq.id. was computed between substituted sequence stretches. For GD, we mapped the sequence positions of the AS event to the aligned GD, and computed the seq.id. between the GD, considering only the aligned positions within that region. The comparison is shown for AS and GD40 (red) and GD80 (blue), respectively.

The diagonal separates the plot into two halves: the upper half corresponds to the region for which GD seq.id. is higher than that for AS; the lower half corresponds to the opposite. For both types of gene families (GD40 and GD80), most substitutions show higher seq.id. amongst gene duplicates than amongst alternative splice variants, and this bias is significant (GD80: 111 of 142, χ2 test p-value < 1.9 × 10−11; and 492 of 786, χ2 test p-value < 6.5 × 10−15, respectively). This result confirms the overall distributions examined in Figure 3B: changes in AS are stronger and more localized than those in GD.

(D) Local seq.id. in AS−/GD+ and AS+/GD− substitutions. To compute local seq.id. in AS−/GD+ families, we first align two GD, then slide a 100-aa window over the sequence of one protein, and compute the seq.id. at all sequence positions of the window. The results of all the possible comparisons are plotted for GD40 (dark red) and GD80 (light violet) families. For genes with AS but no duplicates (AS+/GD−) (dark green), local seq.id. was computed between the two substituted stretches resulting from AS events. As for AS+/GD+ families (Figure 3B), we find that, in general, local seq.id.s are substantially lower for AS events (AS+/GD−) than for GD (AS−/GD+ families). The overlap between the AS and GD40 families is higher than that between AS and GD80 families, which may partly be due to differences in the structure constraints applying to the proteins in each set.

More »

Figure 3 Expand

Figure 4.

The Distribution of Nonconservative Changes along Sequences

The maximal mismatch distance between nonconservative substitutions is much smaller in AS than in GD. The maximal mismatch distance is the number of residues between the two most distant, nonconservative substitutions, normalized by sequence length. Nonconservative mismatches have a negative value in the Blosum62 matrix [65] and were chosen for their stronger impact on protein structure and function. The plot depicts AS data in green, and GD data for families at 80% and 40% seq.id. in light violet and dark red, respectively. We observe that nonconservative substitutions in AS are much more localized than those in GD.

More »

Figure 4 Expand

Figure 5.

The 3-D Distribution of Physico–Chemical Changes in the Affected Residues of AS and GD

The example of mitogen-activated protein kinase 9 (MAPK9). The example of human MAPK9 illustrates how differences between AS and GD in the distribution of sequence changes result in different distributions of physico–chemical properties across the 3-D structure. The original structure of MAPK9 was homology-modelled after MAPK10 and is shown in blue; the residue changes are indicated following a colour scale related to the associated difference in hydrophobicity (we use the absolute value of the difference in order to avoid too many colours; the colour scale goes from blue to red, where the latter corresponds to the largest change). For comparison purposes, the location of the AS changes in the three structures is indicated by a yellow box. As a hydrophobicity measure, we used the free energy of water to octanol transfer [77].

(A) Alternative splice isoforms of MAPK9.

(B) Gene duplicates of high seq.id. (MAPK10; isoform alpha2, 84% seq.id. to MAPK9).

(C) Gene duplicates of medium seq.id. (MAPK13; 46% seq.id. to MAPK9).

We observe, in accordance with the results from the sequence analysis, that while AS changes are located at a very specific location, GD changes are spread all over the protein surface. As expected, the number of changes between MAPK9 and MAPK13 is the largest. Neither one of MAPK9′s paralogues (MAPK10 and MAPK13) shows a set of residue changes identical to that in the alternative splice variant.

More »

Figure 5 Expand

Figure 6.

The Size Distribution of Insertions/Deletions in AS and GD

All analyses of indels have been made for gene families with both AS and GD (i.e., AS+/GD+).

(A) AS indels are longer than GD indels. Indels for GD were obtained from the alignments of GD families at 40% (dark red) and 80% (light violet) seq.id. Information on AS indels (green) was obtained from the SwissProt record of the corresponding protein. Indel size distributions for both GD40 and GD80 are very similar, with most of the indels being shorter than five residues. In contrast, many AS indels are longer than 100 residues.

(B,C) Size distribution for external and internal indels in AS and GD. External indels (B) lie at the N- or C-terminal ends of the protein; internal indels (C) lie in the middle. AS and GD40 indel sizes are different depending on the position of the indels in the sequence. While AS indels are generally larger than GD indels (also see Figure 6A), external indels (B) are larger than internal ones (C), both for AS and GD. The shift in indel sizes implies that large indels (as often introduced by AS) are better-tolerated at the N- and C-termini of proteins, where they are less likely to induce important structural changes.

More »

Figure 6 Expand

Figure 7.

The Overlap between AS and GD Insertions/Deletions

The overlap between AS and GD indels is very small. For the frequency distribution of the overlap between AS and GD indels, AS indels were taken as reference. GD data at 80% seq.id. are shown in light violet, while GD data at 40% seq.id. are shown in dark and light blue for both all indels and only short indels (≤30aa), respectively. Given the small overlap, AS and GD indels are likely to affect different locations in protein structure.

More »

Figure 7 Expand

Table 1.

Summary of the Effects of Alternative Splicing and Gene Duplication on Sequence and Structure

More »

Table 1 Expand