IQSeq: Integrated Isoform Quantification Analysis Based on Next-Generation Sequencing

With the recent advances in high-throughput RNA sequencing (RNA-Seq), biologists are able to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification: here one tries to reconstruct the abundances of isoforms of a gene. We have developed a statistical solution for this problem, based on analyzing a set of RNA-Seq reads, and a practical implementation, available from archive.gersteinlab.org/proj/rnaseq/IQSeq, in a tool we call IQSeq (Isoform Quantification in next-generation Sequencing). Here, we present theoretical results which IQSeq is based on, and then use both simulated and real datasets to illustrate various applications of the tool. In order to measure the accuracy of an isoform-quantification result, one would try to estimate the average variance of the estimated isoform abundances for each gene (based on resampling the RNA-seq reads), and IQSeq has a particularly fast algorithm (based on the Fisher Information Matrix) for calculating this, achieving a speedup of times compared to brute-force resampling. IQSeq also calculates an information theoretic measure of overall transcriptome complexity to describe isoform abundance for a whole experiment. IQSeq has many features that are particularly useful in RNA-Seq experimental design, allowing one to optimally model the integration of different sequencing technologies in a cost-effective way. In particular, the IQSeq formalism integrates the analysis of different sample (i.e. read) sets generated from different technologies within the same statistical framework. It also supports a generalized statistical partial-sample-generation function to model the sequencing process. This allows one to have a modular, “plugin-able” read-generation function to support the particularities of the many evolving sequencing technologies.


Maximum Likelihood Estimation
= P r ( Z s,k = 1, s|Θ (n) ) P r ( s|Θ (n) ) (10) E step: (for all Z s, * , one and only one can have a value of 1) Observed Fisher information matrix
We also have: This means that we only need I(Θ) in order to estimate the performance of our MLE with different sampling method combinations.

Efficient Computation of I(Θ)
where is the expected Fisher information matrix of a single partial sample based on Samp m . Thus we need to be able to compute I (m) (Θ) in order to obtain I(Θ). where is the Fisher information matrix of a partial sample s from Samp m at [a, b) in I k .

Proofs on Equivalent partial samples
where is the Fisher information matrix of a partial sample s from Samp m at [a, b) in I k .    Proof. If δ s1,k = δ s2,k = 0, then obviously δ s1,k G s2,k = 0. Otherwise, if δ s1,k = δ s2,k = 1, then both s 1 and s 2 are partial samples that may be generated by Samp m from I k . In this case, according to Definition 3, G Proof. We first prove by contradiction that ∀I k ∈ I, δ s1,k = δ s2,k : If δ s1,k ̸ = δ s2,k , without loss of generality, we assume that δ s1,k = 1 and δ s2,k = 0. Then there must exist an exon junction e ki → e ki+1 , i ∈ {1, 2, ..., n − 1}, such that e ki → e ki+1 is not compatible with I k (otherwise, as a part of e k1 → e k2 → ... → e kn , s 2 will be compatible with I k ). Since s 1 overlaps with the junction of e ki → e ki+1 , s 1 is not compatible with I k either, which will lead to a contradiction to the previous assumption that δ s1,k = 1. Thus the original statement δ s1,k = δ s2,k must be true.
Then according to Lemma 3, s 1 and s 2 are equivalent w.r.t. Samp m .

Use of empirical G function
To illustrate how non-uniform G function works, we modeled the bias of RNA-Seq data by aggregating signal of mapped reads along annotated transcripts. A signal map of the first base of mapped reads was generated. The signal was subsequently mapped onto the transcript and aggregated for all genes with signal isoform. The aggregation plot for Young Adult is shown in Figure S1. Each transcript is divided into 100 bins, accounting for their different lengths. The signals are normalized by the sum of signals in all the bins. The normalized signal at each bin represents the probability that a read is generated at certain position of the transcript. These non-uniform probabilities gave more realistic estimation of how the reads are generated, and were plugged into the EM calculations.

Results with different read generation assumptions
We have also carried out additional analysis comparing the result of IQSeq with uniform read generation assumption and with an empirical read generation model derived from the MAQC-3 dataset [1]. Figure S2 shows that while the overall results are still correlated, there are a significant number of isoforms with different estimated quantity under such different read generation assumptions.

Replicate variance vs. FIM based variance estimation
With the same MAQC-3 data, we also compared the isoform quantification variances between replicates with the FIM based variance estimations, and their logarithmic values have a correlation of 0.59 ( Figure S3). Figure S1. Aggregated signal over all single isoform genes