J. Hull and D. Kwiatkowski conceived and designed the experiments. J. Hull, S. Campino, K. Rowlands, and G. Elvidge performed the experiments. J. Hull, S. Campino, M.-S. Chan, R. R. Copley, M. S. Taylor, and G. Elvidge analyzed the data. J, Hull, S. Campino, K. Rockett, M.-S. Chan, R. R. Copley, M. S. Taylor, K. Rowlands, G. Elvidge, B. Keating, and J. Knight contributed/ reagents/materials/analysis tools. J. Hull, S. Campino, R. R. Copley, J. Knight, and D. Kwiatkowski wrote the paper.
The authors have declared that no competing interests exist.
Alternative splicing of genes is an efficient means of generating variation in protein function. Several disease states have been associated with rare genetic variants that affect splicing patterns. Conversely, splicing efficiency of some genes is known to vary between individuals without apparent ill effects. What is not clear is whether commonly observed phenotypic variation in splicing patterns, and hence potential variation in protein function, is to a significant extent determined by naturally occurring DNA sequence variation and in particular by single nucleotide polymorphisms (SNPs). In this study, we surveyed the splicing patterns of 250 exons in 22 individuals who had been previously genotyped by the International HapMap Project. We identified 70 simple cassette exon alternative splicing events in our experimental system; for six of these, we detected consistent differences in splicing pattern between individuals, with a highly significant association between splice phenotype and neighbouring SNPs. Remarkably, for five out of six of these events, the strongest correlation was found with the SNP closest to the intron–exon boundary, although the distance between these SNPs and the intron–exon boundary ranged from 2 bp to greater than 1,000 bp. Two of these SNPs were further investigated using a minigene splicing system, and in each case the SNPs were found to exert
Genetic variation, through its effects on gene expression, influences many aspects of the human phenotype. Understanding the impact of genetic variation on human disease risk has become a major goal for biomedical research and has the potential of revealing both novel disease mechanisms and novel functional elements controlling gene expression. Recent large-scale studies have suggested that a relatively high proportion of human genes show allele-specific variation in expression. Effects of common DNA polymorphisms on mRNA splicing are less well studied. Variation in splicing patterns is known to be tissue specific, and for a small number of genes has been shown to vary among individuals. What is not known is whether allele-specific splicing events are an important mechanism by which common genetic variation affects gene expression. In this study we show that allele-specific alternative splicing was observed in six out of 70 exon-skipping events. Sequence analysis of the relevant splice sites and of the regions surrounding single nucleotide polymorphisms correlated with the splicing events failed to identify any predictive bioinformatic signals. A genome-wide study of allele-specific splicing, using an experimental rather than a bioinformatic approach, is now required.
The sequencing of the human genome [
In this study, we used a different approach—that of evaluating effects on splicing efficiency—to study the effects of common genetic polymorphism on gene function. The vast majority of human genes are comprised of three or more exons that need to be efficiently spliced together to form mature mRNA. Variation in this process occurs naturally and is thought to be an important mechanism whereby different protein products can be derived from the same gene sequence [
Our initial aim was to investigate whether there was variation among individual LCLs in simple cassette exon events. These events were defined as the occurrence of complete exon skipping in two or more mRNA isoforms. We used a strategy of exon selection that we believe increased the likelihood of detecting allele-specific effects on alternative splicing. We argue that for genes in which common SNPs affect splicing, at least two mRNA transcript isoforms of that gene will be relatively commonly observed. Conversely, where only one transcript isoform has been observed and documented, the likelihood of a SNP-related splicing event is reduced. We identified 2,281 simple cassette exon events from the European Bioinformatics Institute Alternative Splicing Database (EBI-ASD) in which each transcript isoform had been observed in at least two clone libraries. From these, we selected the 250 genes with the highest expression levels in LCLs as detected by global microarray analysis. We carried out reverse transcriptase PCR (RT-PCR) analysis of these 250 genes and found that in LCLs both transcript isoforms were present in 70 (28%) of the genes.
We proceeded to investigate whether the amount of different isoforms varied between 22 different LCLs. Of the 70 events that produced both full-length and exon-skipped products, we found that 18 (26%) showed significant variation among cell lines, in which at least one cell line showed a ratio of PCR products that differed by more than 10% of the mean value for the entire sample set of 22 cell lines (10% difference in relative abundance is the lower limit of sensitivity of the detection assay). These 18 events were retested using RNA derived from an independent round of cell culture. Six events, centered around genes
Details of the Alternatively Spliced Exons and Associated SNPs
We next investigated the relationship between DNA sequence variation and observed differences in splice isoforms among LCLs. We looked at the correlation between SNP genotype and splicing pattern over the 500-kb region surrounding each of the six splicing events that showed consistent variation among the LCLs. Two sources of SNP genotyping data were used. First, we analysed SNP genotypes from the International HapMap Project [
In each graph, Pearson's
The correlations between splice pattern and individual SNPs are highly significant even after allowing for correction for multiple comparisons. If we use a simple Bonferroni correction for the 350 SNPs that were tested for each simple cassette exon event, all results remain significant at the 0.05 level. This level of correction is overly conservative, since the LD relationship among the SNPs means that they are not independent of one another. Furthermore, it is remarkable that for five of the six events it is the SNP closest to the intron–exon boundary that is the strongest predictor of splicing phenotype. When we analysed the effects of the SNP nearest the intron–exon boundary of each event, a clear effect of genotype on relative abundance of each product was found. The measured ratios of the two splice products are plotted by genotype in
The ratio of transcript abundance (skipped product/full-length product) for each of the six alternative splicing events that showed consistent variation between different individuals is shown. For each gene, the ratios are grouped by the genotype of the SNP nearest the intron–exon junction of the splicing event.
For five of the six events, there is an apparent dose-dependent effect with larger effects seen in homozygotes compared with heterozygotes. For the
Splice site signal scores from the donor and acceptor sites of the test exons predicted to show alternative splicing were compared with those from a genome-wide set of constitutively spliced exons (
The distribution of splice site scores was compared using a density plot generated using the statistical package R (
The potential effects on exonic splice enhancer strength of the four exonic SNPs shown to correlate with splice pattern were tested using four different prediction algorithms (see
There were no differences in the number of SNPs in the 50-bp regions around the intron–exon junction for the 180 exons that did not show alternative splicing in our experimental model, compared to the 70 exons that did. This suggests that using the position of known SNPs to select for exons with allele-specific splicing patterns is unlikely to be fruitful. Furthermore, of the six exons that showed allele-specific splicing patterns, three showed splicing patterns that were correlated with SNPs situated more than 50 bases from the intron–exon junctions.
To investigate whether SNP genotype directly defined splice isoform pattern, we carried out minigene analysis in two genes. In
The graph shows the relative abundance of the two transcripts derived from the minigene plasmid, expressed as a ratio of the shorter transcript (with the test exon skipped) to the longer “full length” transcript. Data shown are means of four measurements with confidence intervals. For each test exon there are significant differences in exon exclusion between the two tested allelic variants (T or C for each gene).
This study describes reproducible phenotypic variation in splicing among individuals, in each case arising from a simple cassette exon event that is associated with genotypic variation in SNPs close to the corresponding intron–exon boundaries. Our starting point was to screen for phenotypic variation in splicing in 22 lymphoblastoid cell lines, and then to identify SNPs associated with this phenotypic variation. Interestingly, the splicing-associated SNPs identified experimentally in this study did not show any clear difference in position or sequence context from other SNPs that were not associated with splicing variation.
The mechanisms by which alternative splicing is regulated are poorly understood. Exon recognition and splicing requires the presence of basic “classic” splice sites (the branch point, polypyrimidine tract, and the 3′ and 5′ splice sites). The efficiency of the splicing process can be affected in some exons by the presence of auxiliary or modulating elements (
In this example, a SNP (represented by a star) in an exonic splice enhancer sequence has disrupted binding of the SR proteins, reducing the efficiency of exon definition and potentially leading to an alternative splice site being used. Similar disruption could affect exonic splice suppressor, intronic splice enhancer, and intronic splice suppressor elements.
ESE, exonic splice enhancer; ESS, exonic splice suppressor; ISE, intronic splice enhancer; ISS, exonic splice suppressor; and p-py, polypyrimidine.
It is perhaps not surprising that we were unable to detect any specific patterns in the sequence context of the six SNPs identified in this study, given the apparent degenerate nature of consensus sequences that bind splice modulator proteins. Overall the splice-site strengths of the exons that were predicted to be skipped by the EBI-ASD database were weaker than those of constitutive exons, and those that we were able to demonstrate to have alternative splicing in our experimental system had the weakest splice site strength. However, there was significant overlap among the groups, and splice site strength cannot be used to identify the most likely exons to study. Equally, the presence of SNPs close to the intron–exon boundaries did not differ between those exons that did and did not show alternative splicing, suggesting that selecting exons to study according to whether there is a “splice site SNP” (defined for example on Ensembl as a SNP lying within 10 bp of the intron–exon junction) will not enrich for those SNPs that actually affect the splicing process. Only two out of the six SNPs identified in this study were within 10 bp of an intron–exon junction. The exonic SNPs that correlated with splice pattern in this study showed no consistent effects on splice enhancer strength using four different predictive models. Thus, the sequence context or position of the SNPs would not identify those likely to influence splicing efficiency. A different approach to identify allele-specific alternative splicing events that does not rely on the sequence context or the position of SNPs is to identify allele-specific RNA isoforms from EST databases [
Our method of isoform quantification and pooling strategy meant that our ability to detect rare events was limited. Dilution experiments determined that both the full-length and exon-skipped transcript products were detectable even when their starting concentrations differed by 100-fold. Thus, provided that both transcripts were present in at least one of the 22 cell lines, and the minor transcript was present at an abundance of 30% or greater, the event would be detected. If the rare transcript was present in three or more cell lines, the sensitivity increased to a lower abundance of 10%. The method we used is not readily scalable to whole genome analysis. Microarray-based approaches to the analysis of alternative splicing have been published [
For the splicing phenotypes, our experiments using the minigene system suggest that the SNP closest to the intron–exon boundary that shows correlation with the splicing phenotype is very likely to be the functional element. For four of the genes in this study there were additional SNPs in complete LD with the SNP nearest the intron–exon boundary, and although most were over 2 kb away from the exon-skipping event it is possible that the presence of these SNPs influence the splicing process. Further work is needed to define the consequences of the loss of these exons on the functional activities of the encoded protein isoforms and in the levels of expression. There is already evidence that biological consequences of the alternative splicing event we describe in
Biological Consequences of Identified Allele-Specific Alternative Splicing Events
In this study we focused on only one form of splicing variation in a relatively small number of genes. Larger-scale whole genome studies investigating additional splicing patterns, such as alternative donor and acceptor sites, will be needed to determine the extent of SNP-associated splicing phenotypes. Our findings raise the possibility that SNP effects on splicing may be at least as prevalent in the genome as those on overall gene expression [
A number of different publicly available databases of observed mRNA transcripts are available. We used the EBI-ASD (
LCLs from 22 unrelated CEPH individuals selected from the HapMap collection were obtained from the Coriell Institute for Medical Research. Cells were cultured at 37 °C in a 5% CO2 environment using RPMI 1640 cell culture medium with 10% fetal calf serum, 200 mM L-glutamine, penicillin, and streptomycin. Cell density was maintained between 200,000 and 800,000 cells/ml. DNA and RNA were each extracted from 10 million cell aliquots. Constitutive expression levels in CEPH cell lines were defined for pooled RNA from four LCLs using an Affymetrix human U133A expression microarray (Affymetrix,
RNA was extracted from cell pellets using TRIREAGENT (Sigma-Aldrich,
Pooled cDNA from all 22 cell lines was used to test each set of primer pairs. Identification of the expected full length product and the shorter product lacking the cassette exon (and no other products) was used to confirm that the predicted alternative splicing event was detectable in our experimental system. Primer sets showing the two expected RT-PCR products were subsequently taken forward to determine if there was variation in the proportion of the two products among different individual cell lines. Detection of variation among cell lines was carried by performing RT-PCR on RNA from each cell line separately. For each cell line, the relative amount of each of the two RT-PCR products (representing the full length and skipped mRNA) was quantified using image analysis of the products visualised on ethidium bromide gels (ImageQuant software; Amersham Biosciences,
We determined the sensitivity of the ethidium bromide quantification system using known starting concentrations of DNA fragments of different lengths, and then quantifying the resulting amplicons. We were able to show that over a range of different product signal intensities, differences in the ratios of the different sized starting material of 10% or greater could be detected reliably (
Each of the 22 cell line samples was assayed in duplicate. The mean ratio of the abundance of the two RT-PCR products from each primer set was calculated for each cell line. When the relative abundance for an individual cell line differed by more than 10% from the average value for the full set of 22 samples, the experiment was repeated using a fresh aliquot of cell culture material. Those events that gave consistent differences in the repeat analysis were then analysed further.
Genotypes for SNPs positioned within 250 kb on either side of the exon-skipping event were downloaded from the International HapMap Project Web site (
For each splicing event with reproducible variation, we calculated Pearson's correlation between the ratio of band intensities for the two RT-PCR products and the SNP genotype. In this analysis, we have assumed that any functional SNPs will be
Splice donor and acceptor sequences were scored using a position specific score matrix (PSSM) method [
Both allelic forms of the SNPs showing correlation with splice patterns in the
Allele-specific differences in
The ratio of transcript abundance (skipped product/full length product) for each of the six alternative splicing events was accurately predicted by the SNP genotype.
(355 KB DPF)
Allelic imbalances were determined using an exonic polymorphism to distinguish the relative abundance of transcript arising from the two alleles (G/A) in 16 unrelated CEPH heterozygous individuals. RNA ratios were normalized with the DNA ratios and the data plots represent the average from two independent experiments. Variability between biological replicas was small (mean of relative difference of 9%)
(237 KB DPF)
Relationship between the measured ratios of band intensity of 2 fragments of DNA after amplification using competitive PCR compared with ratios of the two fragments in the starting material. The two DNA templates were themselves PCR products of different sizes (250 and 463 bp) amplified with M13-tagged primers. These PCR products were diluted and quantified using the picogreen system. A range of different ratios of each of the starting templates was then generated by mixing different volumes together. The mixed samples were then amplified in a single reaction using the M13 primer set, generating two products of different lengths. The products were run out on agarose gels stained with ethidium bromide and visualised with ultraviolet light. Digital photographs of the images were quantified using ImageQuant software (Amersham Biosciences). Each point on the graph represents the mean of eight measurements for each ratio; the bars show 95% confidence intervals. The assay is designed to be sensitive to changes in relative abundance rather than to detect actual molar ratios. Thus, for example, an assay result showing a measured ratio of 3:1 compared with a known ratio of 1:1 does not affect the sensitivity of the assay to detect differences in actual starting concentrations.
(257 KB DPF)
The National Center for Biotechnology Information (NCBI) Entrez Gene (
Antonio Velayos-Baeza and Clotilde Levecque provided expert technical assistance for the minigene experiments.
Centre d'Etude du Polymorphisme Humain
European Bioinformatics Institute Alternative Splicing Database
lymphoblastoid cell line
linkage disequilibrium
reverse transcriptase PCR
single nucleotide polymorphism