Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias

doi:10.1371/journal.pbio.3000481

Fig 1.

Sample-specific length effect couples differential gene expression and length in RNA-seq data.

(A) An RNA-seq experiment that measured gene-expression profiles in TNF and vehicle-treated samples (both silenced for REL-A) (GEO accession: GSE64233) shows a significant coupling between gene length and FC of expression levels (after RPKM normalization of gene counts and averaging over three replicate samples of each condition). Spearman's correlation coefficient is indicated together with its statistical significance. The red line is the linear regression line. (See S1 Fig and S1 Table for results on a collection of 35 publicly available RNA-seq datasets.) Note that throughout the paper, gene length refers to the length of the gene's principal transcript (Methods). (B) Same analysis as in (A), but here the comparison is between two individual replicate samples of the same biological condition (TNF-treated cells silenced for REL-A replicate 1 versus replicate 3, as defined in S2 Table). (By definition, differences in gene expression between replicates reflect experimental technical effects.) Note that in both (A) and (B), data were RPKM-normalized before FC calculation, supposedly accounting for the length effect. Still, there is a technical coupling between FC and length. (C) Sample-specific length effect. Analyzing the two replicate samples from (B), we split the genes into 10 equally sized bins according to length (approximately 1,210 genes in each bin) and examined the distribution of gene expression in each bin. The length effect on expression markedly varies between these two replicates: shorter genes (bins 1–3) show higher expression in replicate 3, whereas longer genes (bins 8–10) show elevated expression in replicate 1. This sample-specific length bias underlies the strong technical link between differential expression and gene length that is shown in (B). (Average length in each bin is indicated below the bins.) Data underlying the results presented in this figure are provided in S1 Data. FC, fold change; GEO, Gene Expression Omnibus; RNA-seq, RNA sequencing; RPKM, reads per kilobase of transcript length per million reads; TNF, tumor necrosis factor.

More »

Expand

Fig 2.

Sample-specific length bias is not removed by widely used RNA-seq normalization methods.

We applied to the RNA-seq data shown in Fig 1B, comparing two replicate samples, six of the most popular normalization methods (RPKM, RPKM followed by qnorm, TMM normalization with FC estimation using edgeR model fit, RLE normalization with FC estimation using edgeR model fit, RLE followed by RPKM, and UQ followed by RPKM). Importantly, none of these methods removed the technical coupling between FC and gene length in this technical comparison. Data underlying the results presented in this figure are provided in S1 Data and in https://github.com/ElkonLab/RNA-seq_length_bias. FC, fold change; qnorm, quantile normalization; RLE, relative log expression; RNA-seq, RNA sequencing; RPKM, reads per kilobase of transcript length per million reads; TMM, Trimmed Mean of M values; TNFa, tumor necrosis factor alpha; UQ, upper quartile.

More »

Expand

Fig 3.

Sample-specific length bias leads to false positive results by GSEA.

(A) As an example, GSEA analysis applied to the comparison between the two replicate samples shown in Fig 1B detects the GO category "mitochondrial-membrane-part" as a significantly enriched gene set (FDR < 0.001) (top). Genes assigned to the "mitochondrial-membrane-part" category are colored in blue in the scatter plot (red line is a lowess line) (bottom). Genes assigned to this GO category are significantly shorter than the set of all other genes expressed in the dataset (background set shown in gray) (p-value calculated using Wilcoxon test) (right). (B) cqn was applied to correct sample-specific length effects and cancel the coupling between gene length and differential expression. (C) Same GSEA analysis as in (A) but performed here after the data were corrected by cqn. Notably, cqn canceled the sample-specific length bias, and consequently, the GO category mitochondrial-membrane-part is no longer enriched. (See S3 Fig for numerous additional examples.) Data underlying the results presented in this figure are provided in S2 Data. cqn, conditional quantile normalization; FDR, false discovery rate; GO, Gene Ontology; GSEA, gene-set enrichment analysis; lowess, locally weighted scatterplot smoothing; NES, normalized enrichment score.

More »

Expand

Fig 4.

Sample-specific length bias correction by cqn reduces GSEA false calls without compromising the detection of true ones.

(A) Application of GSEA to the original data comparing TNF- and vehicle-treated samples (Fig 1A) detects both biologically true gene sets (in this example, the GO category “inflammatory response”) and false gene sets that stem from the FC–length technical effect (in this example, the GO category “mitochondrial protein complex”). (B) After cqn, the false call is no longer significant, and the detection of the genuine set is not compromised. (See S5 Fig for additional examples.) Data underlying the results presented in this figure are provided in S2 Data. cqn, conditional quantile normalization; FC, fold change; FDR, false discovery rate; GO, Gene Ontology; GSEA, gene-set enrichment analysis; NES, normalized enrichment score; TNF, tumor necrosis factor.

More »

Expand

Fig 5.

cqn correction in an EMT dataset as a test case in which the true biological response is genuinely coupled to gene length.

EMT is known to involve strong induction of ECM genes. (A) True and false gene sets (GO “extracellular structure organization” and GO “ribosomal subunit” gene sets, respectively) detected by GSEA on RPKM-normalized EMT RNA-seq data (GSE114572). (B) cqn correction does not compromise the detection of the true set (ECM) but abolishes the false one (ribosomal subunit). (C) Length distribution of genes assigned to the GO extracellular structure organization and GO ribosomal subunit gene sets. Data underlying the results presented in this figure are provided in S3 Data. cqn, conditional quantile normalization; ECM, extracellular matrix; EMT, epithelial–mesenchymal transition; FDR, false discovery rate; GO, Gene Ontology; GSEA, gene-set enrichment analysis; NES, normalized enrichment score; RNA-seq, RNA sequencing; RPKM, reads per kilobase of transcript length per million reads.

More »

Expand

Table 1.

More »

Expand