Union Exon Based Approach for RNA-Seq Gene Quantification: To Be or Not to Be?

doi:10.1371/journal.pone.0141910

Fig 1.

Analysis workflow and illustration of different methods for gene quantification.

A) The overall mapping and counting workflow. B) A hypothetical gene, its two isoforms and read coverage profile. Assuming that the sum of mapped reads from all genes is 1 million, and each small and large exon is 1kb and 2kb long, respectively. C) featureCounts results, denoted as fc_rpkm. After exon flattening, the ‘union exons’ are 2 kb, 1 kb and 2 kb long, respectively. The calculated RPKM is 6.4. D) RSEM results, denoted as rsem_txSum_rpkm. Mapped reads are first distributed to individual isoforms, and the expressions for the two isoforms are 2 RPKM and 6 RPKM, respectively. Accordingly, the entire gene expression is 8 RPKM. Note in Fig 1D, the reads distributed to individual isoforms can be added up first, and then the sum of reads is divided by the length of union exons to give rise to rsem_rpkm.

More »

Expand

Table 1.

The STAR mapping and featureCounts counting summaries.

More »

Expand

Table 2.

The total number of counted reads by featureCounts and RSEM.

More »

Expand

Fig 2.

A) The scatter plot for “fc_rpkm versus rsem_rpkm”. In general, the results reported by RSEM and featureCounts are very close, and nearly identical for high expression genes. B) The scatter plot for “rsem_txSum_rpkm versus rsem_rpkm”. The difference between rsem_txSum_rpkm and rsem_rpkm is much larger than the difference between fc_rpkm and rsem_rpkm. C) The cumulative distribution of RPKM ratio (rsem_txSum_rpkm / rsem_rpkm).

Note in A) and B), the x-axis and y-axis represent log2(RPKM+0.5). To avoid the division by zero, the ratio in C) is calculated from RPKM, and a background value 0.01 is added to original RPKM.

More »

Expand

Fig 3.

The effect of intron retention reads on gene quantification.

Gene CLEC4GP1 has only 1 known transcript consisting of 9 exons. In sample HBRR_C4, many intron retention reads are mapped to gene CLEC4GP1, and counted by featureCounts. The calculated fc_rpkm (19.04) is much higher than rsem_rpkm (4.00). The exon and intron regions in coverage profile track are colored in red and dark blue, respectively. Note sample HBRR_C4 is pair-end sequenced, and only the first reads are shown in the read alignment profile.

More »

Expand

Fig 4.

The large difference between rsem_rpkm and rsem_txSum_rpkm for genes RP11-6N17.4 and HAMP.

A) The structures and expressions for gene RP11-6N17.4 and its isoforms. B) The structures and expressions for gene HAMP and its isoforms. ‘Union-exon’-based approach is incorrect for a gene that expresses multiple isoforms of varying length and the short isoforms are prevalent in expression.

More »

Expand

Table 3.

The rsem_rpkm and rsem_txSum_rpkm for genes RP11-6N17.4 and HAMP and their isoforms.

More »

Expand

Table 4.

The average RPKM for those 11634 filtered genes.

More »

Expand

Fig 5.

The impact of gene structural features on the difference between rsem_txSum_rpkm and rsem_rpkm.

Genes are sorted according to the structural features, and then non-overlapping bins are defined containing comparable numbers of genes. A) The ratio distribution across the different bins for the number of union exons. B) The ratio distribution across the different bins for the number of transcript isoforms. Taken together, the difference between rsem_rpkm and rsem_txSum_rpkm is primarily affected by the number of transcripts in a gene.

More »

Expand

Fig 6.

The intersection between DE_Gene_Gene and DE_Tx_Gene.

Differential analysis is performed at both gene and transcript level, and DE_Gene_Gene contains the list of DE genes from gene level analysis. All DE isoforms are grouped by genes, and this gene list is denoted as DE_Tx_Gene. There are as many as 2638 genes that are not differentially expressed as a whole, but one or more of its isoforms are.

More »

Expand

Table 5.

Differential analysis results and read counts for genes ENSG00000185963.9 and ENSG00000122126.11, and their isoforms.

More »

Expand

Fig 7.

Isoform changes and switches. Gene BICD2 (Ensembl ID: ENSG00000185963.9) consists of two very similar isoforms.

At the gene level, overall, there is no much difference between sample UHRR_C1 and HBRR_C4. However, the difference is dramatic at the transcript level. For human brain sample HBRR_C4, only the long transcript ENST00000356884.6 is expressed; while in sample UHRR_C1, both isoforms are present. Note in sample UHRR_C1, there are 48 reads that span across the junction site between exons #7 and #8 (colored in indigo), and such reads can only originate from the short transcript ENST00000375512.

More »

Expand

Fig 8.

Isoform changes and switches. Gene ORCL (Ensembl ID: ENSG00000122126.11) has 4 isoforms.

The two short isoforms don’t have appreciable expressions. The rest two long transcripts are nearly identical and differ by a single exon encoding 8 amino acids. The longer isoform ENST00000371113.4 is the only form in brain, whereas both isoforms are present in UHRR_C1. Overall, the gene ORCL is not differentially expressed when comparing UHRR with HBRR group, but its isoforms do, and the directions of isoform changes are opposite.

More »

Expand

Fig 9.

The inaccuracy of isoform quantification is influenced by the strengths of the isoforms.

A) The hypothetical gene has two isoforms, termed #a and #b, respectively. Isoform #a skips exon #2, and is shorter than isoform #b. Exon #1 and #2 are twice as long as exon #3. B) Only the short isoform #a is expressed. C) Only the long isoform #b is present. In B) and C), a portion of reads are assigned to the other isoforms, and the accuracy of isoform quantification is strongly influenced by the expression level of the isoforms. More accurate isoform quantifications will be obtained if the number of exon-exon spanning reads and the read coverage pattern in isoforms are taken into account.

More »

Expand