Fig 1.
Combining sources to identify genes associated with COPD.
(A) The COPD-associated gene, SERPINA1, is alternatively spliced to produce 11 different transcripts. Two transcription start sites are indicated by arrows; splice sites are shown as colored lollipops. Exons are indicated by white bars and introns by horizontal gray bars. The coding sequence start and stop codons are indicated with pink lines. 11 transcripts are depicted and colored by splice site selection. The 11 transcripts make SERPINA1 a particularly complex gene in terms of alternative splicing, 99.5% of human genes have fewer transcripts. (B) COPD-associated genes were identified by merging disease-associated genes from different literature reviews (left) and combining them with genes from the NHGRI GWAS catalog (right). Other comparative disease lists were compiled in the same manner, including (C) Parkinson’s disease-associated genes, (D) Type 2 diabetes-associated genes and (E) Alzheimer’s disease-associated genes. The number of genes from each source is indicated in the Venn diagrams.
Fig 2.
COPD-associated genes are transcriptionally complex.
(A) We calculated the average number of transcripts per loci for length-normalized control loci and each disease-associated gene list. The control sets used as comparisons are labeled. The mean and standard deviation of the control lists are shown (black bar, grey box). Only the COPD-associated gene list is significantly different from the normalized control (p = 0.001). (B) We analyzed the number of gene loci that produce 1, 2, 3, 4, 5, and 6 or more transcripts and plotted these gene loci by their proportion in each list. The COPD-associated gene set has significantly fewer loci with 1 transcript and significantly more loci with more than 5 transcripts (p = 0.005 and p = 0.001).
Fig 3.
The transcript complexity enrichment of COPD-associated genes is robust.
(A) Microarray data from the sputum of ex-smokers diagnosed with COPD stage 2, 3, or 4 identified 85 COPD-associated genes with significant differential expression (gds4265). These 85 genes are shown on the y-axis and individuals are shown on the x-axis with their COPD status marked by disease stage along the top. (B) We calculated the average number of transcripts per loci for these 85 differentially expressed COPD-associated genes along with length-normalized control gene sets. The mean and standard deviation of the control lists are shown (black bar, grey box). These expressed genes are significantly more transcriptionally complex than controls (p = 0.001).
Table 1.
Transcriptionally complex loci associated with COPD from genes expressed in stage 2 versus stages 3 and 4 COPD patients [38].
Fig 4.
COPD splice junctions have lower GC content than expected.
(A) Diagram of a representative gene with the regions of interest labeled, including the transcription start site (TSS), the first donor splice site (Donor), the first acceptor splice site (Accept) and the subsequent internal donor and acceptor splice sites (DonorMid and AcceptMid). (B) We calculated the average GC content for the 60 nucleotides surrounding each region in COPD-associated genes as well as genes associated with other diseases (PRK, T2D and ALZ). As a comparison, we also analyzed the GC content of genes with disease-associated SNPs within the TSS or splice junctions (black dots).
Table 2.
In COPD the difference between the average GC content of the transcription start site and the first acceptor splice site is significantly smaller than expected (p = 0.004).
Fig 5.
Skipped exons and alternative acceptor splicing events are differentially enriched in COPD genes.
(A) SERPINA1 contains a number of different splicing events including (B) skipped exons, (C) alternative donors and (D) alternative acceptors. (E) On average, COPD genes contain fewer skipped exons and significantly more alternative acceptor splice events (p = 0.030) per total splice events in comparison with normalized reference genes.
Fig 6.
Highly expressed COPD-associated gene transcripts are enriched in lung tissue.
The highest expressed transcript from each COPD-associated gene is shown as a percentage of expression in each of the 16 tissues from BodyMap. COPD-associated gene expression is highest in lung, white blood cells, testes and liver (14.5%, 9.9%, 9.0% and 8.7%), respectively. Other COPD-associated transcripts not in these tissues are still tissue specific and may be detrimental if expressed in the lung. Splice variants were determined as alternative splice sites through ASprofile [95].
Fig 7.
Usage of SERPINA1 and AGER splice variants by tissue and COPD status.
(A) The expression of each splice variant in SERPINA1 was normalized across tissues. The pattern of expression shows that these splice variants are not broadly used in every tissue, but specific to the kidney, liver, lung and white blood cells. (B) Likewise, AGER is expressed in nearly every tissue tested, but splice variants are have specific patterns that imply regulation. Splice variants were determined as alternative splice sites through ASprofile [95]. (C) RNA-seq data from control subjects and COPD patients indicate that in SERPINA1 four exons are significantly differentially used (p < 0.05). (D) In the AGER gene twelve exons are differentially expressed in COPD patients compared to normal controls (p < 0.05). Exon usage was generated with DEXSeq [101].