Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

The Level of Residual Dispersion Variation and the Power of Differential Expression Tests for RNA-Seq Data

  • Gu Mi ,

    neo.migu@gmail.com

    Affiliation Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America

  • Yanming Di

    Affiliations Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America, Molecular and Cellular Biology Program, Oregon State University, Corvallis, Oregon, United States of America

Abstract

RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

Introduction

Over the last ten years, RNA-Sequencing (RNA-Seq) has become the technology of choice for quantifying gene expression changes in comparative transcriptome analysis [1]. The negative binomial (NB) distribution has been widely used for modeling RNA-Seq read counts [2ā€“4]. Although early studies have shown that the Poisson model is adequate for modeling RNA-Seq count variation from technical replicates [5], many recent RNA-Seq analyses revealed that RNA-Seq counts from biological replicates show significant extra-Poisson variation. The NB distribution can be derived as a mixture of Poisson distributions in the so-called Gamma-Poisson model. For a random variable Y having an NB distribution with mean Ī¼ and dispersion Ļ•, the variance is given by Var(Y) = Ī¼ + Ļ•Ī¼2, and the dispersion parameter Ļ• determines the extent to which the variance exceeds the mean. The square root of Ļ• is also termed ā€œbiological coefficient of variationā€ (BCV) in [6].

The dispersion Ļ• is a nuisance parameter in tests for differential expression (DE), but correct estimation of Ļ• is essential for valid statistical inference. In a typical RNA-Seq experiment, our ability to detect truly DE genes is hampered by the large number of genes, the small sample size, and the need to estimate the dispersion parameters. To ameliorate this difficulty, many different NB dispersion models have been proposed (see the Background section for more details) with a common theme of ā€œpooling information across genesā€. An NB dispersion model relates the dispersion to some measure of read abundance, a, through a simple parametric or smooth function f with a small number of parameters Ī± (estimated from data): (1) where i indexes genes and j indexes biological samples. For example, in [4] we let a be preliminarily estimated mean relative frequencies and let f be a linear or quadratic function of log(a). This and other dispersion models are motivated by empirical evidence of a trendā€”over all genesā€”of decreasing size of dispersion parameter with increasing relative frequency of RNA-Seq reads for the genes. By introducing a dispersion model f, one hopes to summarize the dispersion parameters for all genes by a small number of model parameters Ī± and thus drastically reduce the number of nuisance parameters to estimate. A dispersion-modeling approach as described above can lead to power saving, if a correct or ā€œclose enoughā€ model is used. While empirical evidence overwhelmingly suggests a general trend between dispersion level and mean expression, goodness-of-fit measures [6, 7] suggest simple parametric and smooth function models may not be able to capture the total variation in dispersion (see the subsection ā€œBackground/Goodness-of-Fit Testsā€ for more details).

The key question that motivates this study is, even when a dispersion model shows lack-of-fit, to what degree can it still be useful in improving the power of the DE test. It will be convenient for us to consider a general trend in dispersion parameter, but also allow for variation about the trend, as follows: (2) where Īµ represents an individual component in Ļ• that is unexplained by the trend. Intuitively, the strategy of ā€œpooling information across genesā€ through a dispersion model f will be most effective if the overall level of residual variation in Īµ is low. In this paper, as an approximation, we model Īµ using a normal distribution Īµi āˆ¼ š’©(0,Ļƒ2) and quantify the level of variation in Īµ by Ļƒ2. We estimate Ļƒ for five real RNA-Seq datasets (from human, mouse, zebrafish, Arabidopsis and fruit fly) and then investigate the power and robustness of DE tests when the amount of residual variation in dispersion matches that from the real data. We also explore how the relative performances of different DE test methods will change as the magnitude of Ļƒ changes.

In this paper, we focus on the overall level of deviation (summarized by Ļƒ) from an estimated model for log dispersion. Zhou et al.[8] discussed the impact of ā€œoutliersā€ā€”a small number of highly influential outlining casesā€”on the performance of DE test. Under our framework, it is possible to investigate the impact of such individual outliers by considering non-normal models (such as a binomial or Poisson point process model) for Īµ, but such extensions are nontrivial and we will not pursue them in this paper. Our approach for estimating Ļƒ2 is related to the empirical Bayes approach for estimating Īµ under a normal prior distribution. However, our focus in this paper is in estimating Ļƒ2, not the individual Īµiā€²s. The quantity Ļƒ2 is related to the quantity d0 discussed in [9]. We explain this connection in more details in the subsection ā€œBackground/Weighted Likelihood and Empirical Bayes Methodsā€.

Background

RNA-Seq

In brief, a typical RNA-Seq pipeline can be summarized as follows: purified RNA samples are converted to a library of cDNA with attached adaptors, and then sequenced on an HTS platform to produce millions of short sequences from one or both ends of the cDNA fragments. These reads are aligned to either a reference genome or transcriptome (called sequence mapping), or assembled de novo without the genomic sequence. The aligned reads are then summarized by counting the number of reads mapped to the genomic features of interest (e.g., exons or genes), and the expression profile is eventually represented by a matrix of read counts (non-negative integers) where rows are genes (or some other genomic features like exons) and columns are samples. Subsequent steps that rely heavily on statistical analyses include normalization of reads and testing DE genes between samples under different environmental or experimental conditions.

NB Regression Models

An NB regression model for describing the mean expression as a function of explanatory variables includes the following two components:

  1. An NB distribution for the individual RNA-Seq read counts Yij: where i = 1,ā€¦,m indexes genes, j = 1,ā€¦,n indexes samples, Ī¼ij is the mean, and Ļ•ij is the dispersion parameter such that Var(Yij)=Ī¼ij+Ļ•ijĪ¼ij2.
  2. A log-linear regression model for the mean Ī¼ij as a function of p explanatory variables Xjk (k = 1,ā€¦,p): (3)
These two components resemble a generalized linear model (GLM) [10], but note that the dispersion Ļ•ij is unknown (see the ā€œNB Dispersion Modelsā€ subsection below). The two additive constants, log(Nj) and log(Rj), have to do with count normalization: accounting for different observed library sizes (Nj) and the apparent reduction/increase in the expression levels of non-DE genes resulting from the increased/decreased expression of a few truly DE genes [3, 11]. The normalization constants, Nj and Rj, are pre-estimated and treated as known during GLM fitting. In many applications, the same constant (Nj Rj) is assumed for all genes in a sample, but it may be advantageous to introduce between-gene normalization factors to account for some gene-specific sources of technical biases such as GC-content and gene length [12]. Between-gene normalization can be incorporated into the GLM framework as well. See [13ā€“15] for relevant discussions.

DE Tests

Testing differential expression can often be reduced to testing that one or more of the regression coefficients equal zero. For example, for comparing gene expression levels between two groups, we can let p = 2, Xj1 = 1 for all j; Xj2 = 1 if sample j is from group 2 and Xj2 = 0 if sample j is from group 1. Under this parameterization, Ī²1 corresponds to group 1ā€™s relative mean expression level and Ī²2 corresponds to the log fold change between group 2 and group 1. The null hypothesis is H0:Ī²2 = 0.

In general NB regression settings, exact tests are not available, but asymptotic tests, such as likelihood ratio test, can be used. Di et al. [16, 17] showed that the performance of likelihood ratio test in small sample settings can be improved with higher-order asymptotics (HOA) adjustment. Lund et al. [18] discussed quasi-likelihood (QL) methods by replacing likelihood ratio test with QL F-test for better FDR control, where the test statistic is based on quasi-dispersion parameter estimates or two variants called QLShrink and QLSpline for pooling information across genes.

NB Dispersion Models

As mentioned in the Introduction section, many current DE analysis methods use an NB dispersion model to capture the general trend between dispersion and read abundance. The different DE analysis methods can be put into the following general categories according to the functional form f of the dispersion model and the treatment of individual variation (see Equation (2)):

  1. Common: Earlier works of Robinson and Smyth [19] discussed a common dispersion model where f is a constant. In other words, Ļ•ij = c for all i, j.
  2. Parametric function: Recognizing an evident trend between the dispersion and relative gene expression, Di et al. [4] adopted a parametric NBP model where the log dispersions are modeled as a linear function of the log relative mean frequencies. Referring to Equation (1), in an NBP model, aij=Ļ€ij=Ī¼ijNjRj and f(aij;Ī±) = Ī±0 + Ī±1log(Ļ€ij). A natural extension to NBP is the NBQ model which incorporates an extra quadratic term: (4)
  3. Smooth function: Anders and Huber [3] suggested fitting a non-parametric curve to capture the dispersion-mean dependence. McCarthy et al. [6] introduced a similar ā€œtrendedā€ (non-parametric) model. NBPSeq added an NBS model for non-parametric smooth dispersion model.
The methods above ignore possible individual dispersion variation (i.e., Īµi in Equation (2)) in subsequent DE tests.
  1. 4. Shrinkage methods: McCarthy et al. [6] discussed options to use weighted average between genewise dispersion estimates and trended estimates in an empirical Bayes framework (we will call this method ā€œtagwise-trendā€). The genewise estimates can also be shrunk towards a common value [20]. Love et al. [12] added a shrinkage option in DESeq2.
  2. 5. Quasi-likelihood methods: Lund et al. [18] suggested fitting a quasi-likelihood (QL) model by specifying (for gene i and sample j): (5) with the NB variance function Vi(Ī¼ij)=Ī¼ij+Ļ‰iĪ¼ij2. Both the NB dispersion parameter (Ļ‰i) and the quasi-likelihood dispersion parameter (Ī¦i) are estimated from the data and used to model the variance of the read count Yij. The QL-dispersion Ī¦i adjusts for degrees of freedom and accounts for uncertainty in the estimated NB variance. A shrinkage method is used to estimate Ī¦i and two variants, ā€œQLShrinkā€ and ā€œQLSplineā€, differ in the formulation of prior distribution of Ī¦i. These QL-based approaches are implemented in the QuasiSeq package. (See also, the review in the subsection ā€œWeighted Likelihood and Empirical Bayes Methodsā€ below.)
  3. 6. Genewise: The NBPSeq package allows for fitting NB regression model and performing DE test to each gene separately without assuming any dispersion model. HOA adjustment is used to improve the performance of the likelihood ratio test.
In the above, we mainly summarized methods implemented in the R/Bioconductor packages DESeq, DESeq2, edgeR, NBPSeq and QuasiSeq[21, 22]. They represent the wide range of currently available options. These packages use slightly different predictors (aij in Equation (1)) in their dispersion models, and also use different methods to estimate dispersion models, but these differences are of no primary interest in our power-robustness analysis. As we will see later, the main factor that influences the DE test performance is how the individual dispersion variation is handled.

Goodness-of-Fit Tests

Mi et al. [7] discussed a resimulation-based goodness-of-fit (GOF) test for negative binomial models fitted to individual genes, and then extended the test to multiple genes using Fisherā€™s method for combining p-values. The paper also introduced diagnostic plots for judging GOF. McCarthy et al. [6] transformed genewise deviance statistics to normality and used QQ-plot to examine GOF of different dispersion models. In particular, their QQ-plots (Fig. 2 in their paper) indicated that simple dispersion models, such as a common or trended dispersion model, showed lack-of-fit when used to model an RNA-Seq dataset from a study on oral squamous cell carcinomas (OSCC). One question that motivated this study is how different DE test methods perform when the fitted dispersion model (the trend part) shows lack-of-fit. Intuitively, the performance of different test methods, especially the ones that do not explicitly account for individual residual variation, should be related to the level of residual dispersion variation. We want to make this statement more precise. This motivated us to quantify the level of residual dispersion variation using Ļƒ2 and relate the power/robustness analysis to the magnitude of Ļƒ2.

Weighted Likelihood and Empirical Bayes Methods

In the edgeR package, one can estimate the genewise (or tagwise) dispersion by maximizing the weighted average of two adjusted profile likelihoods: (6) where APLi is computed from each gene separately, and APLS represents the general trend in mean-dispersion dependence. The detailed formulation of APLS(Ļ•i) has been evolving over the years. For example, it can be formed by a (weighted) average of APLi values for genes near i. This weighted likelihood method has its root in empirical Bayes method and APLS serves as the prior likelihood [6, 9, 20].

To estimate G0, Chen et al. [9] considered an empirical Bayes approach using quasi-likelihood. A variance function V(Ī¼) was used to specify the mean-variance relationship according to, for example, a Poisson or a negative binomial model, and a quasi-likelihood function: (7) was used to model the additional variation in the mean-variance relationship between genes (they indexed genes with letter g while we use i in this paper). Chen et al. [9] assumed a scaled inverse Ļ‡2 prior distribution of Ļƒi2: with parameters s02 and d0. In comparison, the model (Equation (2)) in this paper is on the dispersion parameter. The parameter d0 is called the prior degrees of freedom and it plays an analogous role as Ļƒ2 in this paper. For a series of simulated datasets, our estimates of Ļƒ2 is approximately inversely proportional to estimates of d0 as explained below (see Fig. E in the Supporting Information S1 File).

Under an empirical Bayes framework, the parameters of the prior distribution are estimated from the data. Let Di be the residual deviance of the generalized linear model fitted to read counts and di be the known effective residual degrees of freedom for gene i. Chen et al. [9] explained that given Ļƒi2, the mean residual deviance, defined as has, approximately, a scaled chi-square conditional distribution: It then follows that the marginal distribution of si2 is a scaled F-distribution: s02 and d0 can be estimated from si2 using the method of moments. Chen et al. [9] suggested that one can use d0di as G0 in the weighted likelihood (Equation (6)). Recent versions of edgeR provide this option. However, for the simulations performed in this paper, when performing DE tests using edgeR, we estimated the dispersion parameters using the edgeR functions estimateGLMTrendedDisp and estimateGLMTagwiseDisp, where similar weighted likelihood was considered, but the default value G0 = 10 was used (see also McCarthy et al. [6]).

The variance function (V(Ī¼)) and quasi-likelihood function (7) described above are essentially the same ones as considered in [18] (cf. Equation (5)), but the estimation methods and the definition of di used in the two papers were slightly different (e.g., one of the reviewers pointed out that a refinement was made in Chen et al. [9] where di is decreased slightly to allow for bias in the residual deviance associated with exact zero counts). In [18], the estimated d0 was used for constructing the quasi-likelihood F-test. Wu et al. [23] proposed another empirical Bayes shrinkage estimator for the dispersion parameter which aimed to adequately capture the heterogeneity in dispersion among genes. The empirical Bayes strategy has also been used in [24] for modeling microarray data.

Other Related Work

There are also recent works on comparing the performances of DE tests: Soneson and Delorenzi [25] evaluated 11 tools for their ability to rank truly DE genes ahead of non-DE genes, the Type-I error rate and false discovery rate (FDR) controls, and computational times. Landau and Liu [26] discussed dispersion estimation and its impact on DE test performance, mainly focusing on different shrinkage strategies (none, common, tagwise or maximum). The key aspects of this paper are to explicitly quantify the level of inadequacy of a fitted dispersion model using a simple statistic, and to link the magnitude of this statistic directly to the performance of the associated DE test.

Results

We investigate the power and robustness of DE tests under realistic assumptions about the NB dispersion parameters. We fit the NBQ dispersion model (see Equation (4)) to real datasets to capture the general trend in the dispersion-mean dependence. We model the residual variation in dispersion using a normal distribution (see Equation (2)) and the level of residual variation is then summarized by a simple quantity, the normal variance Ļƒ2. Because biological variations are likely to differ across species, and experiments involve varied sources of uncertainty, we choose to analyze five datasets from different species that represent a broad range of characteristics and diversity for typical RNA-Seq experiments. The species include human (Homo sapiens), mouse (Mus musculus), zebrafish (Danio rerio), Arabidopsis (Arabidopsis thaliana) and fruit fly (Drosophila melanogaster). The Methods section includes descriptions of the datasets. For each experiment/dataset, unless otherwise specified we will provide the following results:

  1. Mean-dispersion plot with trends estimated from NB dispersion models;
  2. Gamma log-linear regression as informal model checking;
  3. Estimation of the variance Ļƒ2 of dispersion residuals from a fitted dispersion model;
  4. Power-robustness evaluations of DE tests using datasets simulated to mimic real datasets.
The main focus of this paper is on the quantification of the level of residual dispersion variation and power-robustness investigation under realistic settings (3 and 4 above). The diagnostic plots and statistics (1 and 2 above) are useful in routine analysis of RNA-Seq data, and they also help us verify that the NBQ dispersion model largely captures the general trend in the dispersion-mean dependence.

Anders et al. [27] suggested removing genes with less than or equal to one read per million (rpm) in at least n of the samples, where n is the size of the smallest group of replicates. We follow a similar criterion but set n = 1 in order to keep more (lowly-expressed) genes in study. In R, this is achieved by subsetting the row indices by rowSums(cpm(data)>1)>=1. The library size adjustments are computed for genes passing this criterion.

Mean-Dispersion Plots with Estimated Trends from Dispersion Models

Fig. 1 shows the mean-dispersion plots for the two treatment groups in the human dataset (with sequencing depth of 30 million). In each plot, method-of-moment (MOM) estimates (Ļ•^MOM) of the dispersion Ļ• for each gene are plotted against estimated relative mean frequencies (on the log-log scales). For each gene i, Ļ•^iMOM is defined as āˆ‘j=1n[(yijāˆ’Ī¼Ėœi)2āˆ’Ī¼Ėœi]nĪ¼Ėœi2 , where yij are the read counts and Ī¼Ėœi is their mean. Note that for this dataset, the library sizes (column totals) are roughly the same. Genes with Ļ•^iMOMā‰¤0 were not used in the mean-dispersion plots and the gamma log-linear regression analysis. We also overlaid the trends from five fitted dispersion models representing the wide range of currently available options: common, NBP, NBQ, NBS and trended (see the ā€œBackground/NB Dispersion Modelsā€ subsection above). We make the following remarks:

  1. #1. The fitted NBP, NBQ, NBS and trended dispersion models all capture the overall decreasing trend in the MOM genewise estimates.
  2. #2. The fitted models agree more in the mid-section of the expression distribution and less in the tails where genes have extremely low or high expression levels. This kind of behavior is common in non-parametric smooth estimates and regression models, and it has some implications on how we design the power simulations later.
  3. #3. Such mean-dispersion plots are informative in checking how different dispersion models may potentially over-/under-estimate the dispersion parameters, which in turn will influence DE test results.
  4. #4. Note that the deviation of the genewise MOM estimates from the fitted dispersion models is not the same as the Īµ in Equation (2), since this deviation also reflects the additional estimation error due to small sample size.
Mean-dispersion plots for the other four datasets show similar features and are included in Figs. Aā€“D of the Supporting Information S1 File.

thumbnail
Fig 1. Mean-dispersion plots for the human RNA-Seq dataset.

The left panel is for the control group and the right panel is for the E2-treated group. Each group has seven biological replicates. The sequencing depth for this dataset is 30 million. Each point on the plots represents one gene with its method-of-moment (MOM) dispersion estimate (Ļ•^MOM) on the y-axis and estimated relative mean frequency on the x-axis. The fitted curves for five dispersion models are superimposed on the scatter plot.

https://doi.org/10.1371/journal.pone.0120117.g001

Gamma Log-Linear Regression Analysis

As informal model checking, we fit polynomial gamma log-linear regression models of Ļ•^MOM on log(Ļ€^). Table 1 summarizes the variability in the logged genewise dispersion estimates log(Ļ•^MOM) explained by the linear, quadratic and cubic models (results shown for the control group only and without pre-filtering lowly-expressed genes). The proportion of variation in log(Ļ•^MOM) explained by the fitted models varies across species (e.g., for the quadratic fit, it ranges from 31% to 75%) and also depends on sequencing depths. The quadratic regression model improves over the simple linear regression model by explaining an additional 2% to 11% of variation, while adding a cubic term has almost negligible effects.

thumbnail
Table 1. Proportion of variation in log(Ļ•^MOM) explained by fitted models.

https://doi.org/10.1371/journal.pone.0120117.t001

Quantification of the Level of Residual Dispersion Variation

As discussed in the Introduction section, we model the dispersion residuals using a normal distribution, Īµ=log(Ļ•)āˆ’log(Ļ•^)āˆ¼š’©(0,Ļƒ2), and thus quantify the level of residual variation using Ļƒ2 or equivalently Ļƒ. Using the approach described in the Methods section, we estimate Ļƒ from each of the five real datasets after fitting an NBQ dispersion model (see Equation (4)). Table 2 summarizes the estimates and the corresponding standard errors. The magnitudes of Ļƒ^ indicate that the fitted dispersion models do not fully explain the total variation in the dispersion. The NBQ dispersion model uses estimated mean relative frequencies (Ļ€^ij) as predictors, and the results here suggest that there is still substantial individual variation among genes with the same values of Ļ€^ij.

thumbnail
Table 2. Estimated level of residual dispersion variation in five real RNA-Seq datasets.

https://doi.org/10.1371/journal.pone.0120117.t002

It is possible to turn the estimate Ļƒ^ into a goodness-of-fit test for the fitted dispersion model. However, we want to ask whether a dispersion model is useful even when the fitted model shows lack-of-fit. For this purpose, the quantitative measure Ļƒ^ is more intuitive than a test p-value, since it directly reflects the degree of deviation from the fitted dispersion model. In the next section, we will explore the connection between the magnitude of Ļƒ^ and the performance of DE tests in terms of power and FDR.

Power-Robustness Evaluations

We compare the power and FDR/Type-I error control of a range of DE test methods on datasets simulated to mimic the five real datasets.

Simulation Setup.

In our power-robustness analysis, we will compare performance of six DE test methods. We choose one representative method from each of the categories summarized in the ā€œBackground/NB Dispersion Modelsā€ subsection (prefixed with the name of the R/Biconductor package that implements the method, and a colon): NBPSeq:genewise, edgeR:common, NBPSeq:NBQ, edgeR:trended, edgeR:tagwise-trend, and QuasiSeq:QLSpline. These methods represent a range of available options on how to handle the dispersion estimation. The edgeR:common method is included solely for benchmark purpose as it is over-simplified and not recommended for practical use. The NBPSeq:NBQ method represents parametric dispersion models and the NBQ dispersion model generally provides better fit than the simpler NBP model [7]. The edgeR:tagwise-trend method represents the empirical Bayes shrinkage methods [6]. The QuasiSeq:QLSpline method represents quasi-likelihood methods [18]. These methods also use different tests for DE analysis. For testing DE, methods from edgeR use likelihood ratio test, methods from NBPSeq use likelihood ratio test with HOA adjustment, and the QuasiSeq:QLSpline method uses QL F-test. Table 3 provides a summary of the DE test methods compared.

We simulate two-group comparison datasets that mimic the five real RNA-Seq datasets. From each real dataset, we randomly select 5,000 genes and fit NB regression models to them (see Equation (3) and the ā€œBackground/DE Testsā€ subsection above). We generate a new dataset of 5,000 genes based on fitted models. We specify the mean expression levels based on estimated Ī²^ik, with Rj = 1 and Nj reflecting the sequencing depth (e.g., Nj = 2.5Ɨ107 for the human dataset and 1.5Ɨ107 for the mouse dataset). For all genes, we set Ī²i1 as the estimated value from the real data. If gene i is designated as DE, we either use Ī²^i2 estimated from the real data as its log fold change (i.e., we set Ī²i2=Ī²^i2), or let Ī²i2 correspond to fixed fold changes of 1.2 or 1.5. For any non-DE gene iā€², we set Ī²iā€²2 = 0. In real data analysis, it is unknown which genes are DE. For each dataset, we randomly designate m1 genes as DE. We consider two levels, 0.1 and 0.2, for the percentage of DE genes (Ļ€1 = m1/m). Approximately (when using estimated DE fold changes) or exactly (when using fixed DE fold changes) half of the simulated DE genes are over-expressed and half are under-expressed. Early microarray studies had shown that a smaller proportion of DE genes tend to make it more difficult to control FDR at the nominal level [28].

We specify the dispersion parameters according to Equation (2) with the trend part, f(aij;Ī±), being the fitted NBQ model (fitting Equation (4) to real data). The deviation from the trend is controlled by Īµi and will be simulated according to a š’©(0,Ļƒ2) distribution. We want to choose Ļƒ2 to match the real data, but there is some subtlety in how to achieve this: in practice, when fitting the NBQ model, we use the fitted values Ļ€^ij as the predictors since true Ļ€ij values are not available, but when we simulate counts, the Ļ€^ij values are not available. Our solution is to use Ļ€ij as predictor in the NBQ model when simulating Īµ, but choose Ļƒ=ĻƒĖœ through a calibration approach such that if we were to fit the NBQ model to the simulated data laterā€”using the estimated Ļ€^ij as predictor, the estimated Ļƒ^ would match the one estimated from the real data (also using the estimated Ļ€^ij as predictor). The estimated values of Ļƒ^ from real datasets are summarized in Table 2. The calibrated values ĻƒĖœ and the details about the calibration approach are presented in the Methods section. In our simulations, we will consider different levels of residual dispersion variation and set Ļƒ to ĻƒĖœ, 0.5ĻƒĖœ or 0.

There are other factors that may potentially contribute to the difference in DE test performance, such as the presence of outliers, the proportion of up and down-regulated genes, potential correlation between gene expression levels, to just name a few. In this paper, we will focus on the impact of unmodeled dispersion variation on DE test performance.

Power Evaluation.

For power evaluation, we plot true positive rates (TPR) versus false discovery rates (FDR). For a DE test, a true positive (TP) indicates the test correctly identifies a DE gene; a false positive (FP) indicates the test incorrectly identifies a non-DE gene as DE; and a false negative (FN) indicates the test incorrectly declares a DE gene as non-DE. The TPR and FDR are defined as: TPR = TP/(TP + FN) and FDR = FP/(TP + FP). A TPR-FDR curve contains equivalent information as a precision-recall curve or an ROC curve, but focuses on the relationship between TPR (power) and FDR. The power of a DE test depends on the alternative hypothesis and will likely vary between genes. The TPR reflects the average power of a test to detect truly DE genes in a simulated dataset. If we compare the TPR of the tests at the same FDR level, we are essentially comparing the size-corrected power.

The upper row of Fig. 2 shows the TPR-FDR plots for the six tests performed on each of the five datasets simulated to mimic the five real datasets. In particular, the simulated datasets have the same level of residual dispersion variation Ļƒ2 as estimated from the five real datasets, and the fold changes of DE genes are also estimated from real data. A better method will have its TPR-FDR curve closer to the lower-right corner, indicating a lower FDR for achieving a fixed power, or a higher power for a fixed tolerable FDR. For four of the datasets, the QuasiSeq:QLSpline, edgeR:tagwise-trend and NBPSeq:genewise methods outperform the NBPSeq:NBQ, edgeR:trended and edgeR:common methods, with the edgeR:common method being the worst. For the simulation dataset based on the Arabidopsis real dataset, no test dominates at all FDR levels.

thumbnail
Fig 2. True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.

The fold changes of DE genes are estimated from real data. The columns correspond to the following datasets (left to right) used as templates in the simulation: human, mouse, zebrafish, Arabidopsis, and fruit fly. The level of residual dispersion variation, Ļƒ, is specified at the estimated value (ĻƒĖœ) in panels labeled with A (first row), and half the estimated value (0.5ĻƒĖœ) in panels labeled with B (second row). In each plot, the x-axis is the TPR (which is the same as recall and sensitivity) and the y-axis is the FDR (which is the same as one minus precision). The percentage of truly DE genes is specified at 20% in all datasets. The FDR values are highly variable when TPR is close to 0, since the denominator TP + FP is close to 0.

https://doi.org/10.1371/journal.pone.0120117.g002

It is somewhat surprising that the performance of the simple NBPSeq:genewise method is comparable to the best methods in all cases. This indicates that if the level of residual dispersion variation is as high as the estimated (see Table 2), the potential power saving through dispersion modeling is quite limited.

The relative performance of the tests will change if the level of residual dispersion variation (Ļƒ2) changes. The lower row of Fig. 2 shows the TPR-FDR plots when Ļƒ is simulated to be half the estimated values (Ļƒ=0.5ĻƒĖœ), again with DE fold changes estimated from real data. The performance of the NBPSeq:NBQ and trended methods has much improved and is better than the NBPSeq:genewise method in three of the datasets (the ones based on mouse, zebrafish and Arabidopsis). When we further reduced Ļƒ to 0 in our simulations, all methods outperformed the NBPSeq:genewise approach. The QuasiSeq:QLSpline and edgeR:tagwise-trend methods managed to perform consistently well as we vary the magnitude of Ļƒ.

To understand how each method performs under a wide range of situations, we also performed simulations where the fold changes for DE genes were fixed instead of estimated from real data, while other settings (e.g., the percentage of DE genes, Ļƒ and ĻƒĖœ) remained the same as before. Figs. 3 and 4 show the TPR-FDR plots when the fold changes of DE genes were fixed at 1.2 (low) and 1.5 (moderate) respectively. In general, the NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline perform better than edgeR:common, NBPSeq:NBQ and edgeR:trend, which is consistent with the observations when the fold changes are estimated from real data. In the low DE fold change case and when the residual dispersion variation is as estimated (upper row of Fig. 3), there is more separation between the QuasiSeq:QLSpline method and the edgeR:tagwise-trend method. In the simulation based on the mouse data, the NBPSeq:genewise method outperforms all other methods for finding the first 25% of truly DE genes (i.e., in the plot region where TPR ā‰¤ 0.25), but it is eventually outperformed by QuasiSeq:QLSpline and edgeR:tagwise-trend if a greater percentage of truly DE genes need to be detected. Similar trend is observed in simulations based on the zebrafish and fruit fly datasets. This indicates the NBPSeq:genewise method can have advantage for detecting DE genes with small fold changes. There is less separation between QuasiSeq:QLSpline and edgeR:tagwise-trend methods when the DE fold changes were specified to be 1.5. Again, the performance of all methods assuming a dispersion model (i.e., all methods except NBPSeq:genewise) improves significantly when the residual dispersion variation is halved.

thumbnail
Fig 3. True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.

The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those described in Fig. 2 legend.

https://doi.org/10.1371/journal.pone.0120117.g003

thumbnail
Fig 4. True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.

The fold changes of DE genes are fixed at 1.5 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those described in Fig. 2 legend.

https://doi.org/10.1371/journal.pone.0120117.g004

FDR and Type-I Error.

In practice, the Benjemini-Hochberg method [29] is commonly used to control the FDR of DE tests. In Table 4, we compare the actual FDR of the different DE tests based on the simulation results when the nominal FDR is set to 10% using the Benjemini-Hochberg method. The results are based on the datasets simulated to mimic the human dataset, where we vary the percentage of DE genes (10% and 20%) and we vary Ļƒ from estimated value (Ļƒ=ĻƒĖœ), to half the estimated value (Ļƒ=0.5ĻƒĖœ), and then to 0. We consider three ways to specify fold changes (FC) for DE genes: estimated from data, FC = 1.2 and FC = 1.5. The QuasiSeq:QLSpline and NBPSeq:genewise methods have good controls on FDR in all cases, and are conservative in some cases. The edgeR:tagwise-trend method has good FDR control when the percentage of DE genes is high (20%), but underestimates FDR in several cases when the percentage of DE genes is low (10%). For the NBPSeq:NBQ and edgeR:trended methods, the FDR control improves as the residual dispersion variation decreases and as the percentage of truly DE genes increases. The edgeR:common method does not have good control of FDR in almost all scenarios.

Fig. 5 shows what will happen if one uses the reported FDR to identify DE genes. We uses one of the simulated human data as an example (the fold change is specified to be 1.2 for the designated 20% DE genes, and Ļƒ=ĻƒĖœ), since the tests are well separated here. For methods that do not correctly control FDR, such as NBPSeq:NBQ and edgeR:trended, if one identifies DE genes according to a cutoff on reported FDR (e.g., 10%), more genes will be detected as DE (than if one were able to use the actual FDR) at the cost of underestimated FDR.

thumbnail
Fig 5. True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq dataset simulated to mimic the human dataset.

On each curve, we marked the position corresponding to a reported FDR of 10% with a cross. The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those for the upper row of Fig. 2.

https://doi.org/10.1371/journal.pone.0120117.g005

The FDR control is closely related to the test p-values. Fig. 6 shows the histograms of p-values computed for the non-DE genes in one of the datasets used for the FDR comparison above (fold change estimated from data, 20% DE and Ļƒ=ĻƒĖœ). The histograms from the NBPSeq:genewise and QuasiSeq:QLSpline methods are replacedclosermore close to uniform. For the edgeR:common, NBPSeq:NBQ and edgeR:trended methods, the histograms are asymmetric v-shaped: there is an overabundance of small p-values as compared to a uniform distribution, but the histograms also indicate that these tests are conservative for many genes. Similar patterns have been observed for other dispersion-modeling methods by Lund et al. in [18]. The edgeR:tagwise-trend method produces conservative p-values.

thumbnail
Fig 6. Histograms of p-values for the non-DE genes from the six DE test methods.

The simulation dataset is based on the human dataset with Ļƒ specified as the estimated value Ļƒ=ĻƒĖœ. Out of a total of 5,000 genes, 80% are non-DE.

https://doi.org/10.1371/journal.pone.0120117.g006

Fig. 7 shows similar histogram comparisons when Ļƒ was reduced to half the estimated value (0.5ĻƒĖœ), while fold change and DE percentage remained the same. The null p-value histograms from the NBPSeq:NBQ and edgeR:trended methods have improved and are closer to the uniform distribution. The edgeR:tagwise-trend method produces a slight overabundance of small p-values. The edgeR:common method is still unsatisfactory.

thumbnail
Fig 7. Histograms of p-values for the non-DE genes from the six DE test methods.

The simulation dataset is based on the human dataset with Ļƒ specified as half the estimated value Ļƒ=0.5ĻƒĖœ. Out of a total of 5,000 genes, 80% are non-DE.

https://doi.org/10.1371/journal.pone.0120117.g007

Conclusion and Discussion

We quantified the residual dispersion variation in five real RNA-Seq datasets. Using simulations, we compared the performanceā€”in terms of power and FDR/Type-I error controlā€”of six representative DE test methods based different dispersion models. We demonstrated that the level of residual dispersion variation is a crucial factor in determining the performance of DE tests. When the residual dispersion variation is as high as we estimated from the five real datasets, methods such as NBPSeq:NBQ and edgeR:trended, which ignore possible residual dispersion variation, fail to control Type-I errors and give suboptimal power. The QuasiSeq:QLSpline and edgeR:tagwise-trend methods have similar size-corrected power, but the edgeR:tagwise-trend method underestimates FDR when the percentage of DE genes is low or when the fold changes of the DE genes is low. QuasiSeq:QLSpline and edgeR:tagwise-trend both account for individual dispersion variation. QuasiSeq:QLSpline also makes degrees-of-freedom adjustment to address the uncertainty in estimated NB dispersions. Based on these observations, we recommend incorporating individual variation and using degrees-of-freedom adjustment to improve robustness and Type-I error control for DE test methods that use a dispersion model.

The NBPSeq:genewise method does not rely on a dispersion model, and it uses an HOA technique to improve small-sample performance of the likelihood ratio test. The NBPSeq:genewise method has good Type-I error and FDR control in all simulations. The power of the NBPSeq:genewise method is comparable to that of the QuasiSeq:QLSpline and edgeR:tagwise-trend methods when the level of residual dispersion variation is high. This indicates that when the level of dispersion variation is high, the power saving available through dispersion modeling is limited.

Reducing the level of dispersion variation boosts the performance of DE tests that use a dispersion model. One may attempt to improve the dispersion model by considering different functional forms of the trend and/or including additional predictors. We plan to explore such possibilities in our future research. It is not well understood what factors contribute to the count and dispersion variation in an RNA-Seq experiment: possible factors to consider include transcript length, GC-content, and so on.

One notable difference between the NBPSeq:genewise method and a dispersion-modeling method is that the former detects more DE genes with small fold changes, while a method using a dispersion model tends to detect DE genes with large fold changes. This phenomenon agrees with what we observed in the power simulation when the DE fold change was fixed to be low, 1.2. Fig. 8 illustrates this point using MA plots. This is because current dispersion models often assume the dispersion is the same for genes with similar mean levels (those genes having the same x-values). Under such assumptions, large fold changes tend to correspond to more significant test results. The behaviors of the edgeR:tagwise-trend and the QuasiSeq:QLSpline methods are intermediate between the NBPSeq:genewise method and a dispersion-modeling method such as the edgeR:trended model.

thumbnail
Fig 8. MA plots for the edgeR:trended, NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline methods performed on the mouse dataset.

Predictive log fold changes (posterior Bayesian estimators of the true log fold changes, the ā€œMā€ values) are shown on the y-axis. Averages of log counts per million (CPM) are shown on the x-axis (the ā€œAā€ values). The M- and A- values are calculated using edgeR. The highlighted points correspond to the top 200 DE genes identified by each of the DE test methods.

https://doi.org/10.1371/journal.pone.0120117.g008

For the six methods we compared, edgeR:common, edgeR:trended, edgeR:tagwise-trended use likelihood ratio test. NBPSeq:genewise and NBPSeq:NBQ use HOA-adjusted likelihood ratio test. From our past studies, we know that HOA adjustment mainly corrects for Type-I error and does not significantly change the power when compared to the unadjusted likelihood ratio test. So the differences between the these five methods in the power comparison are mainly attributable to how they handle the dispersion estimation, especially with respect to the two factors highlighted in Table 3: 1) whether they consider a trend f in log dispersion, and 2) whether they consider possible additional individual variation Īµi. The HOA adjustment in NBPSeq may have contributed to the different Type-I error performances. QuasiSeq:QLSpine uses a different test for DE and differs from the above five methods in more aspects. Regarding the dispersion estimation, it considers the general trend f in the dispersion, considers additional individual variation, and uses some degree-of-freedom adjustment. We believe all three aspects contributed to its performance.

We used a š’©(0,Ļƒ2) distribution to model the residual dispersion variation Īµi (see Equation (2)). We believe this is a reasonable starting point. The authors in [23] made a similar assumption and used simple diagnostic plots to show the normality assumption was reasonable. To rigorously test this assumption, however, is challenging due to the small sample size. It might be more useful to consider alternative model assumptions on Īµ, compare results and investigate sensitivity to model assumptions. In future, we will also consider the possibility that Ļƒ may vary with some other variables, such as the mean level. However, the general conclusion that the performance of the DE tests depends on the level of the residual dispersion variation should remain valid.

Methods

Description of RNA-Seq Datasets

Experiment information for all species and the raw/processed data are available at the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI). Table 5 gives a brief summary of the datasets analyzed in this paper, including the dataset names in the SeqDisp R package we develop (see the Software Information section), the SRA accessions that provides all the metadata describing a particular study (see the NCBI website for different accession types), and published references. In the Supporting Information S1 File, see ā€œAccess to the Datasetsā€ section and Table A for more details.

thumbnail
Table 5. Summary of RNA-Seq datasets analyzed in this article.

https://doi.org/10.1371/journal.pone.0120117.t005

Human RNA-Seq Data.

The Homo sapiens (human) RNA-Seq experiment was discussed in [30]. In this study, researchers compared the gene expression profiles for human cell line MCF7 cells (from American Type Cell Culture) under treatment (10 nM 17Ī²-estradiol (E2)) versus control. Information for this experiment, the raw and processed data are available at NCBI GEO under accession number GSE51403.

Liu et al. [30] focused more on the technical side of RNA-Seq experiments by investigating the trade-offs between sequencing depth (where a higher depth generates more informational reads) and the number of biological replicates. Seven biological replicates of both control and E2-treated MCF7 cells were sequenced, and the RNA-Seq reads in each sample were down-sampled to generate datasets of different depths (a total of seven depths from 2.5M to 30M). We include datasets from two sequencing depths (5M and 30M) in our R package, but mainly focus on the dataset with 30M sequencing depth for analyses. See [30] and NCBI GSE51403 for detailed descriptions of the dataset.

Mouse RNA-Seq Data.

The Mus musculus (mouse) RNA-Seq experiment was discussed in [31]. This experiment used RNA-Seq to study the impact of competent versus abnormal human embryos on endometrial receptivity genes in the uteri of 25-day wild-type C57BL/6 mice. Information for this experiment and the raw data are available at NCBI GEO under accession number GSE47019. The raw data are downloaded from NCBI Sequence Read Archive (SRA), and processed using the pipeline described in [27].

We summarize the samples of ā€œControl Salkerā€, ā€œDevelopmentally competent embryo conditioned media Salkerā€ (abbreviated as DCECM) and ā€œArrested embryo conditioned media Salkerā€ (abbreviated as AECM) into the mouse dataset in the SeqDisp R package. We only consider the control and DCECM groups in the analyses.

Zebrafish RNA-Seq Data.

The Danio rerio (zebrafish) RNA-Seq experiment was discussed in [32], and information for this experiment and the raw data are available at NCBI GEO under accession number GSE42846. This study compared gene expression profiles of zebrofish embryos infected with Staphylococcus epidermidis versus control. Four biological replicates are prepared for the control group (Non-injected 5 DPI) and for the treatment group (S. epi mcherry O-47 5 DPI).

Arabidopsis RNA-Seq Data.

The Arabidopsis thaliana (Arabidopsis) RNA-Seq experiment was discussed in [33], and information for this experiment and the raw data are available at NCBI GEO under accession number GSE38879. This study analyzed 7 days old seedlings from two lines of Arabidopsis (rve8-1 RVE8::RVE8:GR and rve8-1) treated with dexamethasone or mock. The overall design includes transgenic line rve8-1 RVE8::RVE8:GR and rve8-1 treated with DEX or mock with three biological replicates each, for a total of 12 samples. Our analyses only focus on the RVE8:GR_mock control group, and the RVE8:GR_DEX treatment group.

Fruit Fly RNA-Seq Data.

The Drosophila melanogaster (fruit fly) RNA-Seq experiment was discussed in [34], and information for this experiment and the raw data are available at NCBI GEO under accession numbers GSM461176 to GSM461181. The experiment compared gene expression profiles of fruit fly S2-DRSC cells (FlyBase cell line) depleted of mRNAs encoding RNA biding proteins versus control. The dataset fruit.fly in our SeqDisp package is directly obtained from the pasilla Bioconductor package [35], which provides per-exon and per-gene read counts computed for selected genes in [34]. It can also be accessed from data(pasillaGenes) once pasilla is loaded. The dataset contains three and four biological replicates of the knockdown and the untreated control, respectively. See the pasilla package vignette for more information.

Quantifying the Level of Residual Dispersion Variation

Estimating Ļƒ2.

In the RNA-Seq context, we use Yij to denote the read count for gene i in sample j, where i = 1,ā‹Æ,m and j = 1,ā‹Æ,n. We model a single read count as negative binomial with mean Ī¼ij and dispersion Ļ•ij: and assume a log-linear model for Ī¼ij, i.e., log(Ī¼ij)=offset+Xjā€²Ī²i (see also Equation (3)). We further assume a parametric distribution as the prior distribution for the dispersion parameter Ļ•ij: where Īµi āˆ¼ š’©(0,Ļƒ2). The prior mean, log(Ļ•ij0), is preliminarily estimated according to a dispersion model (e.g., NBQ or a smooth fit like NBS) and is treated as known. Our goal is to estimate Ļƒ2.

Let Īøij = log(Ļ•ij) and Īøij0=log(Ļ•ij0), so that Īøij=Īøij0+Īµi. Across all m genes, we assume that Īµiā€™s are independent, and denote the prior distribution of Īµi by Ļ€(Īµiāˆ£Ļƒ2). The joint likelihood function of the unknown parameters (Ļƒ2,Ī²) is (8) where Li(Ī²iāˆ£Īµi) is the likelihood of Ī²i from gene i for a given Īµi: We want to estimate Ļƒ2 by maximizing the profile likelihood of Ļƒ2: (9) It is difficult to maximize Ī²i with respect to an integrated likelihood. We instead consider (10) where Ī²^i(Īµi) is the MLE of Ī²i for fixed Īµi (and thus fixed Ļ•ij). Ī²^i(Īµi) can be obtained by the standard iteratively reweighted least squares algorithm [36].

Let li(Īµi)=log(Li(Ī²^i(Īµi)āˆ£Īµi)) and Ļ€(Īµiāˆ£Ļƒ2) be the normal density. Equation (10) can be rewritten as (11) The dependence on yij is implicit through li(Īµi) in Equation (11). We approximate the integral in Equation (11) using the Laplaceā€™s method [37]. Let Īµi* maximize so that Then 12Ļ€Ļƒ2āˆ«exp(li(Īµi)āˆ’Īµi22Ļƒ2)dĪµi in Equation (11) can be approximated by

Evaluation of Ļƒ^.

To evaluate the estimation accuracy for Ļƒ, we perform a set of simulations using the human RNA-Seq dataset as the ā€œtemplateā€ in order to preserve observed relationships between the dispersions and gene-specific mean counts. We simulate 5,000 genes with a single group of seven replicates: the mean structure Ī¼ is randomly generated according to a log-normal distribution with mean 8.5 and standard deviation 1.5 (both on the log scale and the values are chosen to mimic the real dataset); the trend of the dispersion is estimated from the real dataset according to an NB2 or an NBQ model; individual residual variation Īµi is simulated according to š’©(0,Ļƒ2) and added to the trend. We compare Ļƒ^ with true Ļƒ specified at eight levels that are within a reasonable range for typical RNA-Seq data: 0.1, 0.3, 0.5, 0.7, 0.9, 1.2, 1.5 and 2.0. At each level of Ļƒ we repeated the simulation three times using different random number seeds for generating Īµi āˆ¼ š’©(0,Ļƒ2). Fig. 9 shows the simulation results. We highlight the median value (out of three repetitions) in solid blue point at each Ļƒ level, and ideally these points should follow the y = x reference line. We see that there is some bias in the estimation. The bias will increase for smaller sample sizes. We see that Ļƒ^ is more accurate for Ļƒ values between 0.3 and 0.9 and less so for Ļƒ values outside this range. The results (not shown) are similar when we use the NBP (log-linear) and NBS (smooth function) models to capture the general trend in the dispersion.

thumbnail
Fig 9. Estimation accuracy of Ļƒ^.

In the simulation, the dispersion is simulated according to an NB2 (left panel) or an NBQ (right panel) trend with added individual variation Īµi āˆ¼ (0,Ļƒ2). The x-axis is the true Ļƒ value and the y-axis is the estimated Ļƒ^. For each true Ļƒ value, the simulation is repeated three times. The blue dots correspond to the median Ļƒ^ values.

https://doi.org/10.1371/journal.pone.0120117.g009

Calibration.

As discussed in the ā€œResults/Power-Robustness Evaluations/Simulation Setupā€ subsection, when simulating the RNA-Seq datasets, we want to choose a Ļƒ that matches the level of residual dispersion variation in real data. We want to correct for potential bias in the estimator Ļƒ^. We also need to account for the discrepancy between Ļ€ij (used when simulating the data) and Ļ€^ij (used when fitting the dispersion model). This is achieved by a calibration approach [38]. The calibrated ĻƒĖœā€™s are essentially obtained from a calibration plot. Fig. 10 shows the calibration plot for the mouse dataset (subsetted to 5,000 genes). We choose the Ļƒ value at eight levels: 0.5, 0.7, 0.8, 0.9, 1.0. 1.1, 1.2 and 1.5, and simulate the dispersion Ļ•ij according to where Īµi āˆ¼ N(0,Ļƒ2) is the residual variation on top of an NBQ dispersion model with the parameters Ī±i,i = 0,1,2, estimated from the mouse dataset. At each level of Ļƒ, we simulate three datasets and obtain three Ļƒ^ā€™s. We then fit a quadratic curve to the eight median Ļƒ^ values as a function of Ļƒ, with a 95% prediction interval superimposed in dashed curves. The Ļƒ^ estimated from the mouse dataset is also calculated, and the value is shown as a horizontal solid line. The intersection of the fitted quadratic curve and the horizontal line (the solid red point) has its x coordinate being the calibrated ĻƒĖœ. Similarly, the intersections between the upper/lower bound of the 95% prediction interval with the horizontal line determine the associated 95% calibration interval (CI) for the calibrated ĻƒĖœ. We only include the calibration plot for the mouse dataset as an illustration. Table 6 summarizes the calibrated ĻƒĖœ with 95% CI for each of the five real datasets.

thumbnail
Fig 10. The calibration plot for estimating residual dispersion variation Ļƒ for the mouse dataset.

The x-axis is the Ļƒ value used to generate the data. The y-axis is the estimated Ļƒ^. The horizontal line correspond to the Ļƒ^ estimated from the mouse dataset.

https://doi.org/10.1371/journal.pone.0120117.g010

thumbnail
Table 6. Calibrated ĻƒĖœ values for the five real datasets.

https://doi.org/10.1371/journal.pone.0120117.t006

Software Information

The proposed approach of estimating the level of residual dispersion variation Ļƒ is implemented as an R package named SeqDisp (released at https://github.com/gu-mi/SeqDisp, under GPL-2 License). The package also provides graphical functionality to generate diagnostic plots for comparing different dispersion methods. All datasets (raw read count tables) analyzed in this article are included in the package. The R codes for reproducing all results in this article are available at the first authorā€™s github page.

Supporting Information

S1 File. Supplementary Information on Datasets, Plots and Discussions.

Access information to the datasets analyzed in this article (Table A), the mean-dispersion plots (Figs. Aā€“D), and discussion of the relationship between Ļƒ^ and d^0 (Fig. E) are provided in the Supporting Information S1 File.

https://doi.org/10.1371/journal.pone.0120117.s001

(PDF)

Acknowledgments

We thank Daniel W. Schafer, Sarah C. Emerson, Yuan Jiang and Jeff H. Chang for helpful discussions. This article is part of a doctoral dissertation written by the first author, under the supervision of YD and DWS.

Author Contributions

Conceived and designed the experiments: GM YD. Performed the experiments: GM YD. Analyzed the data: GM. Contributed reagents/materials/analysis tools: GM YD. Wrote the paper: GM YD.

References

  1. 1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009;10(1):57ā€“63. pmid:19015660
  2. 2. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139ā€“140. pmid:19910308
  3. 3. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11(10):R106. pmid:20979621
  4. 4. Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq. Statistical Applications in Genetics and Molecular Biology. 2011;10(1):1ā€“28.
  5. 5. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research. 2008;18(9):1509ā€“1517. pmid:18550803
  6. 6. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research. 2012;40(10):4288ā€“4297. pmid:22287627
  7. 7. Mi G, Di Y, Schafer DW. Goodness-of-Fit Tests and Model Diagnostics for Negative Binomial Regression of RNA Sequencing Data. PLOS ONE. 2015;10:e119254.
  8. 8. Zhou X, Lindsay H, Robinson MD. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Research. 2014;42(11):e91. pmid:24753412
  9. 9. Chen Y, Lun AT, Smyth GK. Differential Expression Analysis of Complex RNA-seq Experiments Using edgeR. In: Nettleton D, Datta S, editors. Statistical Analysis of Next Generation Sequence Data. Springer; 2014. p. 51ā€“74.
  10. 10. Nelder JA, Wedderburn RWM. Generalized Linear Models. Journal of the Royal Statistical Society Series A (General). 1972;135(3):370ā€“384.
  11. 11. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010;11(3):R25. pmid:20196867
  12. 12. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):550. pmid:25516281
  13. 13. Hansen KD, Irizarry RA, Wu Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012;13(2):204ā€“216. pmid:22285995
  14. 14. Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;12(1):480. pmid:22177264
  15. 15. Risso D, Ngai J, Speed TP, Dudoit S. The role of spike-in standards in the normalization of RNA-seq. In: Nettleton D, Datta S, editors. Statistical Analysis of Next Generation Sequence Data. Springer; 2014. p. 169ā€“190.
  16. 16. Di Y, Emerson SC, Schafer DW, Kimbrel JA, Chang JH. Higher order asymptotics for negative binomial regression inferences from RNA-sequencing data. Statistical Applications in Genetics and Molecular Biology. 2013;12(1):49ā€“70. pmid:23502340
  17. 17. Di, Y. Single-gene negative binomial regression models for RNA-Seq data with higher-order asymptotic inference. Statistics and Its Interface. 2014;In press.
  18. 18. Lund SP, Nettleton D, McCarthy DJ, Smyth GK. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology. 2012;11(5):8.
  19. 19. Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321ā€“332. pmid:17728317
  20. 20. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881ā€“2887. pmid:17881408
  21. 21. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2014. Available from: http://www.R-project.org/.
  22. 22. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. 2004;5(10):R80. pmid:15461798
  23. 23. Wu H, Wang C, Wu Z. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2013;14(2):232ā€“243. pmid:23001152
  24. 24. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004;3(1):Article 3.
  25. 25. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14(1):91. pmid:23497356
  26. 26. Landau WM, Liu P. Dispersion Estimation and Its Effect on Test Performance in RNA-seq Data Analysis: A Simulation-Based Comparison of Methods. PLOS ONE. 2013;8(12):e81415. pmid:24349066
  27. 27. Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols. 2013;8(9):1765ā€“1786. pmid:23975260
  28. 28. Li SS, Bigler J, Lampe JW, Potter JD, Feng Z. FDRcontrolling testing procedures and sample size determination for microarrays. Statistics in Medicine. 2005;24(15):2267ā€“2280. pmid:15977294
  29. 29. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological). 1995;p. 289ā€“300.
  30. 30. Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics. 2014;30(3):301ā€“304. pmid:24319002
  31. 31. Brosens JJ, Salker MS, Teklenburg G, Nautiyal J, Salter S, Lucas ES, et al. Uterine selection of human embryos at implantation. Scientific Reports. 2014;4:Article 3894. pmid:24503642
  32. 32. Veneman WJ, Stockhammer OW, De Boer L, Zaat SA, Meijer AH, Spaink HP. A zebrafish high throughput screening system used for Staphylococcus epidermidis infection marker discovery. BMC Genomics. 2013;14(1):255. pmid:23586901
  33. 33. Hsu PY, Devisetty UK, Harmer SL. Accurate timekeeping is controlled by a cycling activator in Arabidopsis. eLife. 2013;2:e00473. pmid:23638299
  34. 34. Brooks AN, Yang L, Duff MO, Hansen KD, Park JW, Dudoit S, et al. Conservation of an RNA regulatory map between Drosophila and mammals. Genome Research. 2011;21(2):193ā€“202. pmid:20921232
  35. 35. Huber, W, Reyes, A. pasilla: Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011;. R package version 0.2.16.
  36. 36. McCullagh P, Nelder JA. Generalized Linear Models. CRC Press; 1989.
  37. 37. Laplace PS. Memoir on the probability of the causes of events. Statistical Science. 1986;1(3):364ā€“378.
  38. 38. Ramsey FL, Schafer DW. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning; 2012.