Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates

Accuracy of transcript quantification with RNA-Seq is negatively affected by positional fragment bias. This article introduces Mix2 (rd. “mixquare”), a transcript quantification method which uses a mixture of probability distributions to model and thereby neutralize the effects of positional fragment bias. The parameters of Mix2 are trained by Expectation Maximization resulting in simultaneous transcript abundance and bias estimates. We compare Mix2 to Cufflinks, RSEM, eXpress and PennSeq; state-of-the-art quantification methods implementing some form of bias correction. On four synthetic biases we show that the accuracy of Mix2 overall exceeds the accuracy of the other methods and that its bias estimates converge to the correct solution. We further evaluate Mix2 on real RNA-Seq data from the Microarray and Sequencing Quality Control (MAQC, SEQC) Consortia. On MAQC data, Mix2 achieves improved correlation to qPCR measurements with a relative increase in R2 between 4% and 50%. Mix2 also yields repeatable concentration estimates across technical replicates with a relative increase in R2 between 8% and 47% and reduced standard deviation across the full concentration range. We further observe more accurate detection of differential expression with a relative increase in true positives between 74% and 378% for 5% false positives. In addition, Mix2 reveals 5 dominant biases in MAQC data deviating from the common assumption of a uniform fragment distribution. On SEQC data, Mix2 yields higher consistency between measured and predicted concentration ratios. A relative error of 20% or less is obtained for 51% of transcripts by Mix2, 40% of transcripts by Cufflinks and RSEM and 30% by eXpress. Titration order consistency is correct for 47% of transcripts for Mix2, 41% for Cufflinks and RSEM and 34% for eXpress. We, further, observe improved repeatability across laboratory sites with a relative increase in R2 between 8% and 44% and reduced standard deviation.

For the Mix 2 model this means that and ∂ ∂β kj Q(θ ′ |θ) = 0 (4) where i is the index of transcript t = i and k is the index of group g = k. As usual, the update formula of the relative abundances α i is given by where α (n+1) i and p (n) (t = i|r) are the relative abundance and posterior probability after the n + 1-th and n-th iteration of the EM algorithm. In addition to (4), the β kj have to satisfy the constraint M j=1 β kj = 1 (6) where M is the number of mixture components. This constraint can be enforced with the Lagrange method. Taking the derivative with respect to β kj leads to r∈R p(g = k, b = j|r) + β kj λ = 0 (7) which after some rearrangement results in where, as previously, β (n+1) kj and p (n) (·) are the mixture components and posterior probabilities after the n + 1-th and after the n-th iteration, respectively. The posterior probabilities in (8) are given by and where the sums in (9) and (10) extend over all transcripts t = i in group g = k and the posteriors on the right-hand side of these equations can be derived according to Bayes formula as follows and The posterior probability p(r|t = i, b = j) in (11) and (12) is independent of the iteration. In the main paper the p(r|t = i, b = j) where chosen to be Gaussians which are equidistantly distributed across the transcript t = i. Without any tying, the group g = k consists of a single transcript t = i and (8) therefore becomes For global tying, on the other hand, the group consists of all the transcripts within the locus and therefore As a result, the update formula (8) becomes It is interesting to note, that (15) is similar to the update formula for the relative abundances α i , equation (5). This is the case, because for global tying the following holds which is similar to the superposition

Multi-mapping reads and sequence specific bias
The previous discussion assumes that a fragment r maps uniquely to the genomic reference. If, on the other hand, fragment r has multiple hits H(r) on the reference, then needs to be taken into account when estimating the parameters of the Mix 2 model. Rather than calculating (18) during parameter estimation p(h|r) is often set to 1/#H(r) [2]. Equation (18) can be extended to cover the situation of a sequence specific bias. In this case, the probability that a sequence seq(r) within or surrounding fragment r is generated can be smaller than 1 and the right-hand side of equation (18) needs to be multiplied by this sequence specific probability, p(generate|seq(r)). The probability p(generate|seq(r)) can, for instance, be estimated as in [4] by calculating the ratio of the probability of the sequence seq(r) under the biased model to the uniform model. Most commonly, seq(r) is a sequence directly preceding or following r and p(generate|seq(r)) therefore reflects the probability that a primer with start sequence seq(r) anneals to the sample. Details on how equation (18) and its generalization to a sequence specific bias fits into the parameter estimation of the Mix 2 model are given in Section "Parameter estimation". It should be noted that in our current implementation of the Mix 2 model we do not take sequence specific bias into consideration, nor do we use (18) to calculate the posterior probability of a hit. If fragment r has multiple hits H(r) and a sequence specific bias then and the update formula for β kj , equation (8) .
Here p(h|r) is given by equation (18) or the right-hand side of equation (18) multiplied by p(generate|seq(r)) the probability of generating the sequence seq(r), which is either part of or surrounding fragment r.

Identifiability and uniqueness of maximum likelihood solution
The Mix 2 model is identifiable on the set of fragments R iff the mapping θ → p θ (R) is injective, where, as in the previous section, θ is the vector of pairs of parameters The mapping θ → p θ (R) is given by the product of two mappings where A is the linear map given by which is the value of the j-th Gaussian of transcript i for fragment r. Hence r is an index for the rows and the pair (i, j) is an index for the columns of A. The second mapping in (22) is componentwise multiplication of θ given by The mapping M is invertible on the parameters θ since and thus equation (22)  The dimension of the kernel of A is, for instance, greater than 1 if two transcripts t = i and t = i ′ share the same Gaussian b = j and b = j ′ , which happens only if the transcripts have the same length and their exons are properly aligned. This situation can be avoided by shifting the Gaussians p(r|t = i, b = j), p(r|t = i ′ , b = j ′ ) away from each other, which ensures that and removes therefore identical columns in A. Shifting the Gaussians means that some of them are not equidistantly distributed along a transcript but has otherwise a minor effect on the properties of the Mix 2 model. Summarizing, we state the following Equation (26) shows further that the Mix 2 model is equivalent to a mixture model of the distributions p(r|t = i, b = j) with mixture weights c ij if no Gaussian is shared between two transcripts. In this case, the maximum likelihood solution for the c ij is unique, since the log likelihood surface of mixture models is concave [3], and the c ij and the parameters of the Mix 2 model stand in a one-to-one relationship. This can be summarized as follows.
Proposition 2. The Mix 2 model is equivalent to a mixture of the distributions p(r|t = i, b = j) with respective mixture weights c ij if no two transcripts share the same Gaussian. Since the log likelihood function for a mixture is concave there exists a unique maximum likelihood solution for the c ij to which the EM algorithm converges. The α i and β ij of the Mix 2 model can be derived, in this case, from the c ij as follows.

Fragment start distributions in Cufflinks
The Mix 2 model in the main paper factorizes the transcript specific fragment distribution p(r|t = i) as follows where s(r) and l(r) are the start and length of fragment r. Cufflinks [5], on the other hand, reverses the order of s(r) and l(r) in (31) and factorizes p(r|t = i) according to The fragment length distribution p(l(r)|t = i) in (32) is derived from the cumulative distribution of fragment lengths p(l(r)) for the complete data set. For this purpose, p(l(r)) is truncated to the possible fragment lengths for transcript t = i and subsequently renormalized such that where l(t = i) is the length of transcript t = i. The fragment start distribution p(s(r)|l(r), t = i), on the other hand, is assumed to be uniform over the possible fragment starts s(r) for transcript t = i and fragment length l(r), i.e.
The fragment start distribution p(s(r)|t = i) for t = i according to the Cufflinks model can be derived by summing l(r) out of (32). In the absence of fragment length information, e.g. for single-end RNA-Seq data, Cufflinks assumes by default a Gaussian with mean 200 and standard deviation 80 for the cumulative fragment length distribution p(l(r)). For this default setting the fragment start distribution p(s(r)|t = i) is given in Figure  2 (a) of the main article for transcripts with length between 400 bps and 3000 bps. It can be seen that for long transcripts the Gaussian distribution p(l(r)) produces a short and steep tail at the end of p(s(r)|t = i), whereas this tail shifts increasingly to the 5' end of the transcript for shorter transcripts. The assumption of a Gaussian with mean 200 and standard deviation 80 corresponds to a size selection of the fragments prior to sequencing. Thus, Figure 2 (a) in the main text shows that even for a uniform fragment distribution, size selection generates a transcript length specific bias.