A Mechanistic Beta-Binomial Probability Model for mRNA Sequencing Data

A main application for mRNA sequencing (mRNAseq) is determining lists of differentially-expressed genes (DEGs) between two or more conditions. Several software packages exist to produce DEGs from mRNAseq data, but they typically yield different DEGs, sometimes markedly so. The underlying probability model used to describe mRNAseq data is central to deriving DEGs, and not surprisingly most softwares use different models and assumptions to analyze mRNAseq data. Here, we propose a mechanistic justification to model mRNAseq as a binomial process, with data from technical replicates given by a binomial distribution, and data from biological replicates well-described by a beta-binomial distribution. We demonstrate good agreement of this model with two large datasets. We show that an emergent feature of the beta-binomial distribution, given parameter regimes typical for mRNAseq experiments, is the well-known quadratic polynomial scaling of variance with the mean. The so-called dispersion parameter controls this scaling, and our analysis suggests that the dispersion parameter is a continually decreasing function of the mean, as opposed to current approaches that impose an asymptotic value to the dispersion parameter at moderate mean read counts. We show how this leads to current approaches overestimating variance for moderately to highly expressed genes, which inflates false negative rates. Describing mRNAseq data with a beta-binomial distribution thus may be preferred since its parameters are relatable to the mechanistic underpinnings of the technique and may improve the consistency of DEG analysis across softwares, particularly for moderately to highly expressed genes.


Introduction
Since the advent of the microarray around the turn of the 20 th century, whole transcriptome profiling has been of great importance to systems biology [1][2][3][4][5][6][7][8]. The ability to observe how every transcript in a cell population responds to, for example, treatment with a drug or a change in the expression of a gene-of-interest, gives insight into the wiring and function of biological systems. A common method for deriving biological knowledge from such perturbation experiments is to identify lists of differentially expressed transcripts or genes (DEGs) between two (or more) conditions. By analyzing the genes which show up on such lists, one can identify larger functional units such as biological processes, pathways, networks, and organelles that are involved in the response, giving clear hypotheses for further targeted experiments [9][10][11][12][13]. The centralized collection of most transcriptome experiments in databases such as the gene expression omnibus (GEO) and the connectivity map (CMAP) has given further insight by enabling the use of big data methods to identify general trends and connections that do not emerge from a single experiment (or even a handful) [14][15][16].
While the microarray was the transcriptomic workhorse in the 2000s, the advent of massively parallel sequencing has given rise to deep mRNA sequencing (mRNAseq) [17,18], an alternative way to measure the transcriptome. Like most new technologies, mRNAseq was originally much more expensive than microarrays; however, it has now become quite competitive, and in many ways a superior technical method for transcriptome profiling [19][20][21][22]. The basic premise is to isolate mRNA from a sample, PCR amplify it, and then subject it to tens-of-millions of "short" (~50-100 bp typically) sequencing reads. By aligning the resulting sequence reads with the known genome, and then counting the number of reads that align to a particular gene or transcript, one obtains a measurement of expression. One caveat of this traditional form of quantification is the inherent PCR bias that can distort the original number of transcripts in the sample. A recent method based on incorporating a short unique molecular identifier (UMI) sequence into every transcript molecule provides a new method of quantification that reduces PCR bias and thus improves linearity and precision [23][24][25].
Several open source software suites with associated probability models have been developed to analyze mRNAseq data and identify DEGs. The first was Cufflinks / Cuffdiff [17], which has an elegant underlying mathematical model to estimate the "fragments per kilobase of transcript length per million mapped reads" (FPKM) metric of gene expression, and a t-test based on approximate normality of the resulting FPKM estimate. Cuffdiff2 [26] more accurately estimates false discovery rates for DEGs. Using this FPKM metric, Cuffdiff2 is specialized to a transcript-resolution of gene expression, and comparison across different transcripts, but not to count based data, which we focus on here. Other widely used software suites are EdgeR [27], DESeq2 [28] and BaySeq [29], which, as opposed to the FPKM metric of Cufflinks/Cuffdiff, retains the count-based nature of mRNAseq data and describes it with a negative binomial model (also called Poisson-gamma). This probability model describes mRNAseq count data well, and was predominantly used because it is the common choice to describe count-based data that are "overdispersed" (i.e. variance that is greater than the mean) relative to the Poisson distribution (variance = mean); it is well established that mRNAseq data are overdispersed [30,31]. A recent meta-analysis found that each of these softwares can produce quite different DEGs from the same dataset, a result that is common and not entirely surprising given the different modeling and assumptions used. Further, it was shown that the intersection of DEGs from these softwares are preferred to reduce false positives, which indicates that each might benefit from improvements to the underlying probabilistic treatment of the mRNAseq data [32].
To that end, other probabilistic distributions have been examined. The beta-binomial distribution has also been explored, and it also reflects the overdispersion of the data [33,34]. DEG analysis based upon a beta-binomial distribution is now available as an option for BaySeq solely for paired data (distinct from traditional DEG analyses) [35] and in the software BBSeq [36]; however, a derivation of the mean-variance relationship inherent in the beta-binomial distribution has yet to be undertaken. Furthermore, each software, as with negative-binomial or Poisson methods, has its own specific interpretation of the probabilistic models utilized resulting in often very different selections of DEGs following analysis. This suggests the necessity of a theoretical derivation of an appropriate probabilistic distribution: a ground-up, first-principles approach to modeling the mean-variance relationship and overdispersion which, to date, has not been deeply investigated.
Here, we propose that the basic mRNAseq experimental process is mechanistically a binomial experiment: a series of N trials (reads) with an essentially constant probability of success for a particular transcript/gene in each trial. This gives rise to a binomial distribution for counts from technical mRNAseq replicates, with parameters that have physical interpretation. We highlight how this binomial model agrees well with literature data for technical replicates. For biological replicates, we propose that a beta-binomial distribution, where the probability of success follows a beta distribution, can describe the data, and demonstrate its fit to two large literature datasets. Given ranges of beta-binomial parameter values typical for mRNAseq experiments, a quadratic polynomial scaling between variance and mean emerges, as is consistently experimentally observed. The dispersion parameter is the quadratic coefficient that controls this scaling, and our analysis suggests that the dispersion parameter is a continually decreasing function of the mean. Surprisingly, this is different from current approaches that impose an asymptotic value on the dispersion parameter at moderate and high mean read counts. We show how this leads to overestimating variance for moderately to highly expressed genes, which inflates false negative rates in downstream DEG analysis. Because the beta-binomial model emerges from the mechanism of the mRNAseq technique, it may be preferred, and its use might not only help improve consistency in deriving DEGs, but also variance estimation for moderately to highly expressed genes.

Solving for the Dispersion Parameter
For each gene i, we assume s 2 ij ¼ m ij þ φ i m 2 ij and solve for φ i as follows. First, we expand the right hand side of the equation: Including the left hand side provides the following equation: Writing in terms of φ i : After some cancellation, this can be broken into two terms: N j % 1 and 1 N j % 0. Therefore, we find that:

Downloading mRNAseq Data
UMI count data were obtained from the DToXS LINCS website (http://research.mssm.edu/ pst/DToxS) on July 1 st , 2015, from DToXS LINCS ID Raw-Data-R2015-06-30. Raw (Level 1) transcriptomic data released June 30 th , 2015 were downloaded, and data from batch identifier SR-1 were used in this study. There were 15 control samples (with sample name prefix CTRL), but the sample CTRL.1.C1 was excluded because it showed poor correlation with the remaining 14 samples. There were six samples treated with the kinase-inhibitor Sorafenib, (SOR), but samples 1 and 3 were excluded as they had poor correlation compared to the remaining four. Gierlinski yeast data were acquired from the European Nucleotide Archive (ENA) (http:// www.ebi.ac.uk/ena/data/view/ERP004763) consisting of 672 fastq files: 2 cell lines each with 48 biological replicates each with 7 technical replicates. Raw reads from the fastq files were then aligned using Bowtie [37] against the Saccharomyces cerevisiae genome removing reads with multiple alignments to the genome. Aligned reads were then sorted using Samtools [38] and converted into files of gene read counts using Bedtools [39]. We followed the author's method for removing "bad replicates" that did not satisfy a quality score based upon median correlation coefficient, outlier fraction and median reduced χ 2 of pileup depth. We corroborated their calculations and removed six WT biological replicates (21,22,25,28,34,36) and four Δsnf2 biological replicates (6,13,25,35) just as they had done. All raw data are given in S1-S4 Tables.

Estimating Beta-Binomial Distribution Parameters
First, the integer count data in S1-S4 Tables were divided by their respective sequencing depth, which was calculated by summing the counts along a single column (sample). The resulting probability of success estimates for each gene were fit to a beta distribution using method of moments estimates for α and β as follows: ðX i À xÞ 2 is the sample variance. These α and β parameter estimates for each gene are also given in S1-S4 Tables.

Data Normalization
We normalize the data by scaling each sample to have an equivalent sequencing depth as the sample with the maximum sequencing depth. That is, we take N ¼ maxðN j Þ and for each sample j, the normalized read counts are:

Estimation of Dispersion
To obtain a smooth trend of dispersion that follows the data as implied by our beta-binomial formulation, we fit an empirical quadratic polynomial to the plot of log(mean) vs log(dispersion) using the MATLAB fit tool (y = p1 Ã x 2 +p2 Ã x+p3). The parameter values for each data set, in order of (p1,p2,p3) are Gierlinski WT ( To compare our approach of modeling dispersion with previous methods, we downloaded the R packages DESeq2, Version 1.12.2 [28], and EdgeR, Version 3.14.0 [27]. For DESeq2, we uploaded each data set and used the estimateDispersions command which generates three separates formulations of dispersion for each gene: dispGeneEst reflects the raw dispersion estimate from the data, dispFit represents a curve fit to the dispGeneEst data following the distribution expected by DESeq2 and lastly dispersion which is a modified version of dispGeneEst with outliers corrected to reflect the trend of dispFit values. For the purposes of our work, we use the dispersion value for each gene in each data set as that is the recommended setting by DESeq2. For EdgeR, we use the estimateDisp command which also generates three dispersion estimates for each gene: common.dispersion is a single value over all genes as a best estimate of global dispersion, trended.dispersion represents a curve fit to genewise dispersion similar to DESeq2's dis-pFit, and tagwise.dispersion is a gene-specific estimate of dispersion that is modified to reflect the value in trended.dispersion again similar to DESeq2's dispersion value. For our work, we chose the tagwise.dispersion value for each gene.

Estimation of p-values
For each method of acquiring a dispersion estimate, we calculate an estimated variance dependent upon the mean by solving the formula s 2 ij ¼ m ij þ φ i m 2 ij given the normalized mean read counts μ ij and dispersion estimate φ i for each gene i in each dataset. Then we conduct a Welch's t test for the hypotheses that the UMI CTRL and SOR samples have the same mean for a given gene and that the Gierlinski WT and Δsnf2 mutant samples have the same mean for a given gene. To do this, we modified the Matlab method ttest2 to accept as input parameters an estimate for the mean and variance for each sample as opposed to the normalized read counts themselves generating a p value for each gene in each dataset. This is to show how different estimates of dispersion, and thus different estimates of variance, affect the resulting p values for each gene tested in each dataset.

mRNA Sequencing as a Binomial Experiment
An mRNA sequencing (mRNAseq) experiment consists of three main steps (Fig 1). First is isolating mRNA from biological samples (sample index j 2 {1,2,. . .,m}). Second, the mRNA samples are converted into a library that is compatible with the sequencing platform. This often includes fragmenting the original mRNA molecules, along with one or more PCR steps, into n j total fragments (sometimes isolation of mRNA from total RNA is part of the library preparation). Let the number of molecules from a particular transcript i in the library j be n ij = γ ij t ij , where i is the transcript index, t ij is the original number of transcript i molecules in library j, γ j ! 0 is the amplification factor, and n j ¼ The library is then subjected to the sequencing process, where N j of the n j library molecules are randomly chosen for sequencing. The number of trials N j is often called the sequencing depth. The probability of choosing a molecule for sequencing from library j that maps to transcript i is (except in relatively rare cases of capture bias) Denote p ij as the probability of success for transcript i in library j. If the total number of molecules in the library far exceeds the total number of reads (n j >>N j ), then "taking" a fragment from the library for sequencing has negligible effect on this probability, making it essentially constant throughout the selection process. For the common Illumina platform, n j~1 0 9 library molecules are loaded onto the instrument (e.g.~75 μL of a 20 pM library), and a typical sequencing depth for an mRNAseq experiment is N j~1 0 7 reads, giving n j >>N j and essentially constant p ij for all but the few most lowly expressed transcripts.
An mRNAseq experiment with library j can thus be cast as a series of N j trials, with each trial selecting one library fragment for sequencing. We define a trial to be a success for transcript i if a fragment subsequently aligned to it is chosen for sequencing; the probability of success is p ij . This scenario, as described, is analogous to a binomial experiment [40]. Therefore, the probability of selecting k ij fragments from library j that map to transcript i should follow a binomial distribution,  The random variable k ij is often referred to as the number of uniquely mapped reads to transcript i, and has mean μ = N j Á p ij and variance σ 2 = N j Á p ij Á (1−p ij ). In general, p ij << 1 due to the large number of different expressed transcripts in a cell (typically~10,000 [41,42] and see non-zero entries in S1 and S2 Tables). This gives μ = σ 2 for most transcripts, as one expects from a Poisson distribution. This is in excellent agreement with data from technical replicates sequenced from the same library [22], giving direct experimental support for the notion that the mRNAseq process can be cast as a binomial experiment.

Describing Inter-Library Variability with a Beta-Binomial Distribution
When mRNAseq experiments are performed across biological replicates which have different libraries, the probability of success for a transcript varies. Dividing the number of mapped reads for a transcript by the sequencing depth N j gives an estimate of the true (inter-library) probability of success, p i . Because p i is continuous on the unit interval (0 p i 1), a potentially suitable model is a beta random variable [40], with density where B denotes a Beta function of the first kind and α i and β i are parameters to be estimated from biological replicates. The expected value of p i is We have also used Eq 1 and the fact that the total number of library molecules is essentially constant across libraries, due to concentration normalization during loading. When the probability of success for a binomial random variable follows a beta distribution, the resulting random variable is said to follow a beta-binomial distribution. The mean and variance of a beta-binomial distribution are, respectively [43]  As described above, predominantly, p i << 1. Given Eq 4, this implies that β i >> α i for the majority of transcripts. Moreover, since the number of molecules in the library n j is much greater than 1, it is likely that β i >> 1. Given these considerations, the mean and variance reduce to This reveals a characteristic scaling prediction between the mean and the variance via a "dispersion parameter" 1/α i . Such scaling has indeed been well-described for mRNAseq experiments [27,28,30,31]. The full functional form for the dispersion parameter given a betabinomial distribution is given in the Methods section.

Evaluating the Beta-Binomial Model with Data from Multiple Biological Replicates
Two large mRNAseq datasets were utilized to evaluate the beta-binomial model proposed above. The first is available via the Library of Integrated Network-Based Cellular Signatures (LINCS) (see Methods-DToXS LINCS ID Raw-Data-R2015-06-30). The dataset consisted of 14 biological replicate samples (RNA isolated from independent cell batches) of PromoCell cardiomyocyte-like cells treated under control (DMSO/vehicle) conditions (S1 and S2 Tables). The sequencing libraries were prepared using unique molecular identifiers (UMI) [23][24][25], which allows removal of PCR biases (by experimentally estimating the γ ij factor-see Fig 1) via quantification by UMI counts, on the level of genes. We refer to this metric as "Unique UMI Counts". It is also possible to retain quantification by the traditional means of counting the number of reads that uniquely align to a gene. We refer to this metric as "Unique Mapped Read Counts". The beta distribution parameters for each gene were estimated as described in Methods from the 14 biological replicates.
A second mRNAseq dataset developed by Gierlinski et al is available on the ENA archive (see Methods -project ID PRJEB5348), consisting of 48 biological replicate samples in two S. cerevisiae lines: WT and snf2 knock-out mutant [44]. The replicates underwent standard Illumina multiplexed TruSeq library preparation. Each biological replicate consists of seven technical replicates producing 336 datasets in each cell line resulting in "Unique Mapped Read Counts" (S3 and S4 Tables). As with the LINCS data, the beta distribution parameters for each gene were then estimated for each cell line as described in the Methods.
We first sought to understand the space of estimated α and β parameters for the datasets studied. Given the relationship between the beta distribution parameters and expected value for the probability of success in Eq 4, one would predict that β i should remain relatively constant across genes, since most transcript types are a very small fraction of the total number of transcripts in a cell. Furthermore, we would like to evaluate the assumption above that β i >> α i . Fig 2 shows log scale plots of α and β values plotted against the mean for two sets of count data: the LINCS UMI Counts (Fig 2A) and the Gierlinski Yeast WT Mapped Read Counts ( Fig  2B). Two further sets of count data are shown in S1 Fig: the LINCS Mapped Read Counts (S1A Fig) and the Gierlinski Yeast Δsnf2 Mapped Read Counts (S1B Fig). In each panel, α values are represented by x's and β values are represented by circles. First, we observe that β values are indeed significantly larger than α values for all genes tested. Second, β is largely invariant across the transcriptome, consistent with expectations, only slightly decreasing for genes at higher counts (relative to changes in α values). With more typical mRNAseq datasets where one might expect to have three or even fewer replicates, this result implies that a global fit of β across genes may be quite appropriate, similar to "information sharing" approaches of current softwares [27,28]. This might allow improved estimation of the dispersion parameter for each gene, particularly for those with low abundance, which is critical for estimation of variance and downstream differential expression testing [27,28,30,31]. Lastly, it is clear that the mean is largely determined by α, implying that dispersion is strongly linked to the mean.
We next evaluated whether the beta-binomial model captured the mean-variance structure of the mRNAseq data, which is critical for determining differential expression. Here, we focus on a global gene-independent dispersion parameter, and explore gene-specific dispersion parameters subsequently. We calculated the mean and variance for each gene in each of the datasets studied and compared this to the Eq 8 prediction given a beta-binomial model and one of two global estimates for the dispersion parameter. The first estimate for dispersion is based on previous approaches: CV 2 [27]. The second estimate utilizes least squares (LS) regression. We made this comparison for each dataset both before and after a simple scaling normalization procedure (see Methods) to account for differences in sequencing depth between samples. Table 1

Relationship Between Dispersion and Mean
Previous work allows for gene-specific estimation of dispersion [27,28], which imposes a relationship where the gene-specific dispersion parameter asymptotes to a lower bound as mean increases. This relationship derives from the widely accepted quadratic function between variance and mean. This fixed lower bound of dispersion is sometimes called the biological squared coefficient of variation [27], and typically reaches this lower limit at moderate read counts.
The beta-binomial model makes a different prediction about the dependence of dispersion with the mean. Namely, because increases in mean are predominantly driven by increases in α (β is mostly constant across genes), and the dispersion parameter is essentially inversely proportional to α (Eq 8), then we expected the dispersion parameter to be smaller than that imposed by the currently used formalisms in DESeq2 and EdgeR. We compared the beta-binomial dispersion trends with those calculated by DESeq2 and EdgeR (Fig 4 and S3 Fig) for both datasets analyzed above, along with direct estimates of dispersion based on the data themselves. The results indeed displayed evidence that current estimation methods were overestimating dispersion at read counts starting at~100 (5-10% of the genes). We conclude that a beta-binomial representation of mRNAseq data might allow for more precise estimation of gene-specific dispersion, and further that current methods might overestimate dispersion and therefore variance for moderately to highly expressed genes. This may have implications for downstream DEG analysis, since a larger variance would lead to a higher false negative rate.

Statistical Significance of Moderately to Highly Expressed Genes
To demonstrate explicitly how overestimating dispersion could lead to identification of new DEGs, we explored a comparison of treated vs. control data for the UMI data set (DMSO vs. sorafenib) and the Gierlinski dataset (WT vs. Δsnf2). We expected that for genes with moderate to high mean read counts, we would have on average higher statistical significance than current negative binomial based methods. As representative of negative binomial methods we used DESeq2 and EdgeR. Fig 5 shows precisely this prediction; as mean read counts increase, the pvalues calculated for dispersion estimates of a beta-binomial model are much lower than that from typical negative binomial models. This is evidenced by a preponderance of data below zero on the difference of p-value scatter plots above 100 counts for UMI, and 200 for Gierlinski ( Fig 5). This leads to several new genes being called as DEGs, which gives rise to potential new biology being uncovered. Specifically, 597 genes from the Gierlinski dataset and 1023 genes from the LINCS dataset (S5 and S6 Tables). Thus, not only does the beta binomial distribution better capture the statistical dispersion properties of mRNAseq data, but it also has biologically meaningful implications.

Conclusions
Use of mRNAseq to measure transcriptomes is expected to increase, and derivation of DEGs is essential for extracting knowledge from such data. There is no uniform agreement on what probabilistic assumptions and models to use and as such various mRNAseq analysis softwares produce different (sometimes markedly) DEGs. This paper proposes that the mRNAseq process is inherently a binomial process, and a beta-binomial model is an appropriate choice for describing mRNAseq data. We found that current methods may be overestimating dispersion and therefore variance for moderately to highly genes, and that the beta-binomial description can correct this to achieve better sensitivity for medium to highly expressed genes. Standardizing modeling approaches can help to harmonize the DEG outputs from different softwares and thus help to increase knowledge extracted from these increasing amounts of data. Each panel reflects a comparison of p-values for beta binomial-based dispersion or negative binomial-based dispersion generated from the UMI count data, CTRL vs SOR (A-C), or the Gierlinski data, WT vs Δsnf2 (D-F). Each panel is a scatter plot of the base-10 logarithm of the maximum normalized mean (maximum of the CTRL mean or SOR mean for UMI or the WT mean or Δsnf2 mean for Gierlinski) against the difference in base-10 logarithm of the corresponding p-values being compared for each gene. Color indicates density of points. The top row compares the beta binomial formulation versus DESeq2 (A,D). The second row compares beta binomial versus EdgeR (B,E). The third row compares EdgeR and DESeq2 (C,F).