A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory

doi:10.1371/journal.pcbi.1006794

Table 1.

Symbols and definitions.

More »

Expand

Fig 1.

Diagrammatic summarization of the approach.

Asterix (*) on a variable or constant quantity means its is known by design or can be measured/calculated. Double asterisk (**) on a variable means its value is an estimate defined in this study. Index i = 1…s, denotes a spike-in, while i = s + 1…s + q, cellular RNA. For clarity in the diagram, these indices have been given the notation ERCC_1…s and mRNA_{s + 1…s+q}. The remaining mathematical notations in this figure follows exactly that of Table 1. (A) A fixed amount of spike-in RNA is added in fresh lysates from m cells in r repeats. The quantity of added spike-ins is known, and we want to calculate the quantity of endogenous mRNAs. (B) RNA is extracted from the lysates, RNA-seq libraries are prepared using a multi-step protocol, sequenced, aligned and count tables are constructed for spike-ins (i) and cellular RNA (ii). We use the spike-in count table together with the vector of spike-in abundance to estimate the library calibration factor ν, which is in turn applied for the estimation of nominal abundance of endogenous RNA in the sample. The mathematical definition of relative yield (α), and nominal abundance (z) are also shown. Note that the definition of z as a function of α cannot be estimated (**) as neither α_{mRNA_s+1}, nor n_{mRNA_s+1, j} are known.

More »

Expand

Fig 2.

k-fold cross-validation: Inferred vs. actual molecules per cell of spike-ins.

(A) GR experiment data. Inferred (mean) values vs. actual values are plotted (symbols) for each spike-in molecule in each of 3 leave-out conditions: carbon-limited growth at rates of 0.12 (red), 0.20 (green) and 0.30 h^-1 (blue). (B) Ciona lineage specification data. Each symbol corresponds to the inferred value in each of 3 leave-out conditions: LacZ (red), Fgfr^DN (green), and M-Ras^CA (blue). Although in (A) and (B) each leave-out condition is plotted with a distinct symbol, a symbol can appear multiple times for some values along the x-axis, because these values are represented by several different spike-ins; i.e., among the 92 spike-in molecule there are 22 unique abundance values. (C) Measure of performance in three-fold cross-validation in (A). Mean Fold Error (MFE) is computed between inferred and actual molecules per cell. Symbols plot the average value, over 10,00 Monte Carlo trials, of the ratio MFE/MFE_syn versus the mean spike-in library size in the leave-out condition. Vertical bars span the mid 0.95 quantiles of MFE/MFE_syn values obtained in 10,000 MC trials for each leave-out condition. (D) Measure of performance in three-fold cross-validation study in (B) for the Ciona data.

More »

Expand

Fig 3.

MA plot for dilution study.

MA plot for mean RNA abundance (z-values) for libraries prepared with high- and low-dilution spike-in aliquots. The abundance z_i,j corresponding to count y_i,j was obtained by the maximum likelihood normalization z_i,j = y_i,j/ν_j in Eq (2). The ordinates of the scatter plot (one point for each transcript) should be centered around zero, which corresponds to equal inferred transcript abundance for libraries prepared with high- and low-dilution spike-in aliquots.

More »

Expand

Fig 4.

Exponential rate constants for exponential dependence of RNA abundance on growth rate.

Histogram (density scale) of the exponential constant ϕ₁-values in Eq (6), which describes the empirical exponential dependence of RNA abundance on growth rate. The histogram includes only those vales found to be statistically significant at an FDR of 0.01 [39].

More »

Expand

Fig 5.

Up-regulation of ribosomal RNA molecules by growth rate.

(A) Small subunit ribosomal RNA (SSUrRNA) molecules (GO:0015935). Normalized abundances (filled symbols) of significantly up-regulated SSUrRNAs, plotted as a function of growth rate on log-linear coordinates, and corresponding exponential-model values (solid lines), drawn from Eq (6), where γ is growth rate. Each filled symbol at a given growth rate is the normalized mean over 3 replicates at that growth rate; the normalization factor is the mean at the lowest growth rate, 0.12 h^-1. The determination of maximum likelihood parameters, ϕ₀ and ϕ₁, in Eq (6) was based on all replicates, so the model values (solid lines) are not constrained to go through the mean normalized value of 1 at the lowest growth rate. Mean ± sd for exponential constant ϕ₁, 8.0±1.5. (B) Large subunit ribosomal RNA (LSU rRNA) molecules (GO:0015934). Normalized abundances of significantly up-regulated LSU rRNAs. Symbols and lines as in panel A. The total mean ± sd for exponential constant ϕ₁ is 7.9±1.4.

More »

Expand

Fig 6.

Global normalization in the first round ignoring spike-ins.

(A) Based on data form the yeast growth rate study. Relative log expression (RLE) plots of raw counts normalized by median size factors [34]. The condition-dependent variation in the 0.5 quantile of log relative expression of S1D Fig has been largely eliminated. (B) PCA biplot corresponding to (A). (C) RLE plots of normalized counts produced by applying RUVg normalization [15], with one factor of unwanted variation, to the median-normalized counts in panels A and B. The ERCC spike-ins (same median global normalization applied to counts from cellular RNA). The RLE plots exhibit reduced variation of relative log expression within libraries compared to (A). (D) PCA biplot, corresponding to (C). The sensible clustering before RUVg normalization (B) has been disturbed. (E) RLE plots produced by applying a different RUV technique instead, RUVs [15], to the median-normalized counts in (A) and (B). Variation within libraries is somewhat reduced compared to that with median normalization alone in panel A. (F) PCA biplots corresponding to RLE plots in (E). These PCA plots are very similar to those in (A) for median normalization only. Pairwise testing for differential gene expression between growth rates of 0.30 and 0.12 h^-1 gave very similar results for median normalization with and without RUVs normalization.

More »

Expand

Fig 7.

Differential gene expression in Ciona embryonic differentiation study.

(A) Diagnostic histogram of p-values for null hypothesis of no differential gene expression between the Fgfr^DN and LacZ cell types. (B) Fold change (log₂ scale) for the Fgfr^DN/lacZ comparison. The vast majority of the significant (FDR = 0.01) transcripts were down-regulated in the Fgfr^DN cell type, 4,778 out of 4,493. The average fold change for up- and down-regulated transcripts was 2.5 and 1.9, respectively. (C) Similar to panel A, but for the null hypothesis of no differential gene expression between the Fgfr^DN and M-Ras^CA cell types. (D) Fold change, similar to panel B, but for the M-Ras^CA/Fgfr^DN comparison. 1,934 transcripts differentially expressed at an FDR value of 0.01 (corresponding cutoff p-value equal to 0.0017). Of the 1,934 significant (FDR = 0.01) fold changes, 1,560 were greater than 1, and 374, less than 1. The average fold change for up- and down-regulated transcripts was 1.9 and 2.8, respectively.

More »

Expand