Fig 1.
Mean-dispersion plots for the human RNA-Seq dataset.
The left panel is for the control group and the right panel is for the E2-treated group. Each group has seven biological replicates. The sequencing depth for this dataset is 30 million. Each point on the plots represents one gene with its method-of-moment (MOM) dispersion estimate () on the y-axis and estimated relative mean frequency on the x-axis. The fitted curves for five dispersion models are superimposed on the scatter plot.
Table 1.
Proportion of variation in explained by fitted models.
Table 2.
Estimated level of residual dispersion variation in five real RNA-Seq datasets.
Table 3.
Summary of DE test methods compared.
Fig 2.
True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.
The fold changes of DE genes are estimated from real data. The columns correspond to the following datasets (left to right) used as templates in the simulation: human, mouse, zebrafish, Arabidopsis, and fruit fly. The level of residual dispersion variation, σ, is specified at the estimated value () in panels labeled with A (first row), and half the estimated value () in panels labeled with B (second row). In each plot, the x-axis is the TPR (which is the same as recall and sensitivity) and the y-axis is the FDR (which is the same as one minus precision). The percentage of truly DE genes is specified at 20% in all datasets. The FDR values are highly variable when TPR is close to 0, since the denominator TP + FP is close to 0.
Fig 3.
True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.
The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those described in Fig. 2 legend.
Fig 4.
True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.
The fold changes of DE genes are fixed at 1.5 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those described in Fig. 2 legend.
Table 4.
Actual FDR for a nominal FDR of 0.1.
Fig 5.
True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq dataset simulated to mimic the human dataset.
On each curve, we marked the position corresponding to a reported FDR of 10% with a cross. The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those for the upper row of Fig. 2.
Fig 6.
Histograms of p-values for the non-DE genes from the six DE test methods.
The simulation dataset is based on the human dataset with σ specified as the estimated value . Out of a total of 5,000 genes, 80% are non-DE.
Fig 7.
Histograms of p-values for the non-DE genes from the six DE test methods.
The simulation dataset is based on the human dataset with σ specified as half the estimated value . Out of a total of 5,000 genes, 80% are non-DE.
Fig 8.
MA plots for the edgeR:trended, NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline methods performed on the mouse dataset.
Predictive log fold changes (posterior Bayesian estimators of the true log fold changes, the “M” values) are shown on the y-axis. Averages of log counts per million (CPM) are shown on the x-axis (the “A” values). The M- and A- values are calculated using edgeR. The highlighted points correspond to the top 200 DE genes identified by each of the DE test methods.
Table 5.
Summary of RNA-Seq datasets analyzed in this article.
Fig 9.
In the simulation, the dispersion is simulated according to an NB2 (left panel) or an NBQ (right panel) trend with added individual variation εi ∼ (0,σ2). The x-axis is the true σ value and the y-axis is the estimated . For each true σ value, the simulation is repeated three times. The blue dots correspond to the median values.
Fig 10.
The calibration plot for estimating residual dispersion variation σ for the mouse dataset.
The x-axis is the σ value used to generate the data. The y-axis is the estimated . The horizontal line correspond to the estimated from the mouse dataset.
Table 6.
Calibrated values for the five real datasets.