rSeqDiff: Detecting Differential Isoform Expression from RNA-Seq Data Using Hierarchical Likelihood Ratio Test

High-throughput sequencing of transcriptomes (RNA-Seq) has recently become a powerful tool for the study of gene expression. We present rSeqDiff, an efficient algorithm for the detection of differential expression and differential splicing of genes from RNA-Seq experiments across multiple conditions. Unlike existing approaches which detect differential expression of transcripts, our approach considers three cases for each gene: 1) no differential expression, 2) differential expression without differential splicing and 3) differential splicing. We specify statistical models characterizing each of these three cases and use hierarchical likelihood ratio test for model selection. Simulation studies show that our approach achieves good power for detecting differentially expressed or differentially spliced genes. Comparisons with competing methods on two real RNA-Seq datasets demonstrate that our approach provides accurate estimates of isoform abundances and biological meaningful rankings of differentially spliced genes. The proposed approach is implemented as an R package named rSeqDiff.


Introduction
Alternative splicing is an important mechanism in posttranscriptional regulation of eukaryotes. Through alternative splicing, a single gene can produce multiple different transcript isoforms that usually lead to different protein isoforms with different structures and biological functions, which can greatly enrich the diversity of eukaryote transcriptomes [1][2][3]. Several studies also show that many human disease-causing mutations affect alternative splicing rather than directly affecting coding sequences and ill-regulated alternative splicing events have been implicated in a large number of human pathologies [4][5][6]. Due to its vital role in biological processes such as gene regulation, cell differentiation, development and disease pathophysiology, there is an urgent need for the development of new technologies and methodologies for the study of alternative splicing events and the quantification of the expression of alternative isoforms.
In recent years, high-throughput sequencing of transcriptomes (RNA-Seq) has rapidly evolved as a powerful tool for the study of alternative splicing in humans and model organisms [1][2][3]7]. Many RNA-Seq experiments have been conducted to investigate the following two problems: (i) the discovery of novel transcripts and (ii) the estimation and detection of differentially expressed transcripts. Here we focus on the second problem. Several statistical approaches have been proposed in recent years towards this end. One type of approach is exon-based, which focuses on the detection of differential usage of exons [8][9][10][11]. The other type of approach is isoform-based, which focuses on the estimation of differential expression of isoforms across different biological conditions [12][13][14][15][16].
In this article, we present an isoform-based approach for the detection of differential isoform expression from multiple RNA-Seq samples. In particular, we extend the linear Poisson model in [17,18] for the estimation of isoform abundances from single-end or paired-end RNA-Seq data. Unlike existing approaches which detect differential expression of transcripts, we consider three cases for each gene: 1) no differential expression, 2) differential expression without differential splicing and 3) differential splicing. We specify statistical models characterizing each of these three cases and use hierarchical likelihood ratio test for model selection. The remaining part of the paper is organized as follows: We first introduce the statistical model and method, and then use simulations to study the type-I error and statistical power of the proposed method, followed by the analyses of two real RNA-Seq datasets. For the first dataset (an ESRP1 dataset published in [11]), we compare our approach with two other methods (MATS [11] and Cuffdiff 2 [16]) using RT-PCR assays performed in [11]. For the second dataset (an ASD dataset published in [19]), we present a genome-widely analysis of differential splicing between Autism Spectrum Disorder (ASD) and normal brain samples.

Notations
We use similar notations as in [18] to present the statistical model, which are summarized in Table 1 and explained below in details.

The linear Poisson model for multi-sample RNA-Seq data
We extend the linear Poisson model for one-sample RNA-Seq data in [17,18] to multiple samples. Assume there are K conditions in the study, and in the kth condition there are J k distinct read types. A read type refers to a group of reads (single-end or pairedend) mapped to same position in a transcript [18]. We write J k as J to avoid cluttering but note this quantity depends on the condition k. For a gene G of interest with I annotated transcripts (isoforms), we define h as the K6I isoform abundance matrix for all the K conditions, where the kth row vector of this matrix, h k~hk1 ,h k2 , . . . ,h kI ½ T denotes the isoform abundance vector of G in the kth condition, and h ki denotes the abundance of the ith isoform in the kth condition. Correspondingly, each condition has its own read sampling rate matrix where a kij denotes the rate that read type j is sampled from isoform i in condition k. In our implementation we adopt the uniform sampling model in [18] for single-end reads which assumes all the possible read types from a transcript are generated with the same rate. For paired-end reads we adopt the insert length model in [18], which assumes the sampling rate of a particular paired-end read type depends on its insert size. The sampling rate matrix A k can be estimated based on all the mapped reads in condition k [18]. Each condition also has its own read count vector N k~nk1 ,n k2 , . . . ,n kJ ½ T , where n kj denotes the number of reads of type j mapped to any of the I isoforms in condition k. Given h k and A k , N k is assumed to follow the one-sample linear Poisson model [17,18]. In particular, the probability mass function of N k is where h k : a kj~P I i~1 h ki a kij .
Given A k and N k for k = 1, 2, …, K, our goal is to jointly estimate h combining the data from all the samples. This will be complicated by the fact that the h k 's may not be independent of each other under different biological situations. Therefore, we need to re-parameterize h according to the underlying biological situation of whether the gene and its isoforms show differential expression. In particular, we propose the following three nested models ( Figure 1) corresponding to three possible underlying biological situations regarding the pattern of gene expression across multiple conditions.
Model 0 [no differential expression] characterizes the situation where none of the gene's isoforms show differential expression across the K conditions ( Figure 1B, row 1, where the hypothetical gene structure is given in Figure 1A). Under this model, all K conditions have the same isoform expression levels so that all the rows of h are the same and equal to a joint isoform abundance vectorh k~h h 0 , k = 1, 2, … K. Under the assumption that the reads of each condition are generated independently, the joint likelihood function ofh h 0 combining all K conditions is the product of the likelihood of each condition Model 1 [differential expression without differential splicing] characterizes the situation where the gene shows differential expression, but not differential splicing of its isoforms across the K conditions ( Figure 1B, row 2). Under this model, the relative abundances between the isoforms are the same across the K conditions and the rows of h are therefore proportional to each other. Accordingly, we re-parameterize h as the outer product of a K61 vector t and an I61 vectorh h 1 , whereh h 1 is the basic isoform abundance vector for all K conditions, and t is the isoform ratio vector. To make the model identifiable, t is subject to a linear constraint: For the example of model 1 in Figure 1B, h h 1~½ 90,60 T and t~½ J k (J) Total number of read types in the kth condition (we write J k as J to avoid cluttering, but note this quantity depends on the condition k).

A k
The I6J k read sampling rate matrix for the kth condition.

N k
The J k 61 read count vector for the kth condition. h The K6I isoform abundance matrix for all K conditions. The kth row corresponds to the isoform abundance vector for the kth condition.
h h 0 The I61 joint isoform abundance vector for all K conditions (for model 0 only).
The I61 basic isoform abundance vector (for model 1 only). t The K61 isoform ratio vector (for model 1 only).
The kth element of t which is the ratio between the isoform abundance vector for the kth condition and the basic isoform abundance vector, i.e. h k = t kh h 1 (for model 1 only).
Model 2 [differential splicing] characterizes the situation where the gene shows differential isoform usage across the K conditions ( Figure 1B, row 3). Under this model, each condition has its own independent isoform abundance vector h k . Therefore, the joint likelihood function is Maximum likelihood estimation of the three models The parameters of each of the three models can be estimated using maximum-likelihood estimation (MLE). As discussed in [18], one computational burden in solving the MLE is that J could be quite large, especially for paired-end RNA-Seq data. We adopt the two data reduction techniques introduced in [18]: (i) We take only read types with non-zero mapped reads and further group them to form larger read categories; (ii) For each condition k, we compute the total sampling rate for each isoform i w kid ef P J j~1 a kij (denote W k~wk1 ,w k2 , . . . w kJ ½ T as the total sampling rate vector for all isoforms) without enumerating each particular sampling rate a kij .
In practice, we work with the reduced form of the likelihood functions for the three models, and the details of these data reduction techniques are given in Text S1.
Similar to the log-likelihood function for one-sample linear Poisson model given in equation (1) (see also [17,18]), all the loglikelihood functions for the above three models are concave. Therefore, the MLEs for all of the three models can be obtained by linear constraint convex optimization algorithms. In practice, we use an expectation-maximization (EM) algorithm to calculate the MLEs, and the details are given in Text S1.

Model selection using hierarchical likelihood ratio test
Since model 0 is nested within model 1, which is again nested within model 2, we use the likelihood ratio test (LRT) for model selection. For large sample size, the LRT statistics for nested models asymptotically follow x 2 distributions. The degrees of freedom (DF) of the three models are DF(model 0) = I (the free parameters are the I61 joint isoform abundance vectorh h 0 ), DF(model 1) = I+K-1 (the free parameters are the I61 basic isoform abundance vectorh h 1 and the K61 isoform abundance ratio vector t subjects to one linear constraint P K k~1 t k~1 ) and DF(model 2) = K6I (the free parameters are the K6I isoform abundance matrix h), respectively.
Given a pre-specified significance level a (e.g., 0.05), we perform model selection using the following hierarchical likelihood ratio test (hLRT) procedure ( Table 2). The first round tests include two parallel tests which compare model 0 vs. model 1 and model 0 vs. model 2, each at significance level a/2. If neither of the two tests is significant, then model 0 is selected. If only one of the two tests is significant, model 1 or model 2 is selected accordingly. If both tests are significant, we perform the second round test which compares model 1 vs. model 2 at significance level a and selects model 2 if this test is significant or model 1 otherwise.

Ranking of differentially spliced genes
When comparing between two biological conditions (e.g., normal vs. diseased), it is often useful to generate a ranking of genes being differentially spliced (i.e., model 2 genes). We rank model 2 genes as follows: Supposeĥ h 1 andĥ h 2 are the estimated isoform abundance vectors for the two conditions, we calculate the statistic: where ||?|| 1 denotes the vector L 1 norm ([20] uses a similar statistic without the constant 1/2, which is introduced here to have 0ƒTƒ1). Large T values indicate high level of differential splicing. The T value is 0 for model 0 and model 1 genes. Alternatively, genes classified in model 1 or model 2 can also be ranked according to their p-values from the hLRT, if statistical significance is of major interest. The proposed approach is implemented as an R package named rSeqDiff, which is available at http://www-personal.umich.edu/ ,jianghui/rseqdiff/. The analysis pipeline of using rSeqDiff is outlined in Figure S1 and Text S1.

Simulation studies
We study the performance of our proposed hLRT approach by simulating read counts from genes with a wide range of abundances (from lowly expressed genes to highly expressed genes) and report the specificity and sensitivity of our approach for the detection of differential expression and differential splicing events. Detailed procedure and results of the simulation studies are given in Text S1, and here we briefly outline the methods that we applied in the simulations. We test differential expression and differential splicing of a hypothetical gene with a well-annotated known isoform structure ( Figure S2) between two biological conditions with sequencing depths of total 50 million and 55 million reads, respectively. The gene structure and the sequencing depths are fixed in the simulations. For each of the three models, we vary the expression level (denoted as G in Text S1) of the gene within a broad range, and for each G we simulate the number of reads mapped to each of the two isoforms according to the three models (equations (2), (3) and (4)). For each G, we simulate 1000 replicated pairs of samples. We run the hLRT with significance level a = 0.05 using rSeqDiff on the 1000 simulated pairs of samples and report the proportions of the simulated pairs of samples for which our approach correctly selects the true underlying model (i.e., true classification rate). Table S1, S2 and S3 show the true classification rates under model 0, 1 and 2, respectively.
In summary, the simulation studies show that our proposed hLRT approach has well controlled type I error rate at a = 0.05 (Table S1) and good statistical power for detecting differential expression and differential splicing for genes with moderate to high abundance in both conditions (Table S2 and S3). When the gene is lowly expressed in one condition but moderately or highly expressed in the other condition, our proposed hLRT approach still has good power in selecting model 1, i.e., differential expression without differential splicing. The power in detecting differential expression or differential splicing is low when the gene has low expression levels in both conditions, which is well expected. In real data analysis, genes with very low expression levels in all the conditions are usually filtered out prior to the analysis. By default, rSeqDiff filters out genes with less than 5 reads in all the conditions.

Applications of rSeqDiff to real RNA-Seq datasets
We demonstrate the practical usage of rSeqDiff and compare it with two other approaches by analyzing two real RNA-Seq datasets: the ESRP1 dataset and the ASD dataset.
Analysis of the ESRP1 dataset. Epithelial splicing Regulatory Protein 1 (ESRP1) is a master cell-type specific regulator of alternative splicing that controls a global epithelial-specific splicing network [11]. This dataset was published in [11], where Shen et al performed single-end RNA-Seq experiments on the MDA-MB-231 cell line with ectopic expression of the ESRP1 gene and an empty vector (EV) as control. The dataset contains 136 million reads for the ESRP1 sample and 120 million reads for the EV sample. Shen et al used this dataset to demonstrate their exonbased approach MATS for detect differential splicing, and performed RT-PCR assays to test for 164 exons skipping events. Since the biological significance of this dataset was further analyzed in a follow-up paper by Shen and collaborators [21], our analysis here is solely focused on the validation and comparisons of our proposed hLRT approach with other methods using the 164 RT-PCR tested alternative exons as gold standard.
MATS is an exon-based method and its results cannot be directly compared with our isoform-based approach. In the MATS model (Figure 2A, adapted from [11]), exon 2 is the alternatively spliced exon (skipped exon) unique for the longer isoform and exon 1 and 3 are common exons shared by both of the two isoforms. The exon inclusion level y of the skipped exon was defined as the abundance ratio between the longer isoform and the sum of both the two isoforms, which was estimated as by MATS (Figure 2A). The exon inclusion level difference between the two conditions (ESRP1 and EV) was calculated as Dy~y ESRP1 {y EV . The genome coordinates, junctions read counts (UJC, DJC and SJC), y ESRP1 , y EV and Dy values from MATS and RT-PCR for the 164 exons are provided in [11]. We first apply rSeqDiff to these 164 exons using only the junction read counts from [11]. We transform the ''exonexon junction model'' (Figure 2A) to a ''two-isoform'' model ( Figure 2B), where the hypothetical ''isoform 1'' contains two ''exons'' each with length of 84 bp (the length of the exon-exon junction region in [18]) corresponding to the upstream junction (UJC) and downstream junction (DJC), respectively, and the hypothetical ''isoform 2'' contains a single ''exon'' with length of 84 bp corresponding to the skipping junction (SJC). Hence, the abundances of ''isoform 1'' (h 1 ) and ''isoform 2'' (h 2 ) ( Figure 2B) are equivalent to the abundances of the longer and shorter isoforms in exon-based method (Figure 2A), respectively. The exon inclusion level y is then estimated as y~h 1 h 1 zh 2 . For the 164 RT-PCR tested exons, we first use rSeqDiff to estimate h 1 and h 2 using the junction read counts (UJC, DJC and SJC) from [11], and then calculatey ESRP1 , y E V and Dy accordingly. Figure 3A shows the scatter plot of the Dy values estimated by rSeqDiff (using junction reads only) and MATS, and figure 3B shows the scatter plot of the Dy values estimated by rSeqDiff (using junction reads only) and RT-PCR (MATS and RT-PCR results are adapted from [11]). We can see that rSeqDiff gives very similar results as MATS when only junction reads are used, and overall both methods agree well with the RT-PCR assays ( Figure 3B and Table 3). We then apply rSeqDiff using its default settings (detailed method is given Text S1) where all the reads mapped to exons and exon-exon junctions are used (referred as rSeqDiff (all reads) below). We also run another isoform-based approach Cuffdiff 2 [16,22] on the same dataset (details are given in Text S1). These two methods give the estimates of the abundances of all the isoforms. Based on the gene symbols and the genome coordinates of the 164 RT-PCR tested exons in [11], we identify genes containing these exons from the results of rSeqDiff (all reads) and Cuffdiff 2, and calculate the Dy values for these exons based on the isoform abundances estimated by rSeqDiff (all reads) and Cuffdiff 2. Figure 3C shows the scatter plot of the Dy values estimated by rSeqDiff (all reads) and RT-PCR, and Table 3 shows the correlation coefficients of the Dy values between RT-PCR assays and the three methods, rSeqDiff, MATS and Cuffdiff 2, respectively. We can see that rSeqDiff (all reads) outperforms MATS and Cuffdiff 2 significantly.
One major advantage of isoform-based approaches like rSeqDiff and Cuffdiff 2 over exon-based approaches like MATS is that isoform-based approaches use all the reads mapped to exons   Figure 4. In the first example ( Figure 4A), the ARHGAP17 gene has only two isoforms differed by an alternative exon. The isoform structure of this gene is relative simple, and all the three algorithms provide similar estimates which are also validated by RT-PCR. In the second example ( Figure 4B), the ATP5J2 gene has four isoforms differed by an alternative exon in the middle and an alternative 59 splice site on the exon at the 59 end. For this gene with a relative complex isoform structure, the two isoform-based methods, Cuffdiff 2 and rSeqDiff, give more accurate estimates than MATS, and rSeqDiff is slightly more accurate according to the RT-PCR result. In the third example ( Figure 4C), the CSF1 gene has an even more complex isoform structure with four isoforms differed by an alternative exon in the middle and two mutually exclusive exons at the 39 end. For such an isoform structure, some isoforms (NM_172212 and NM_000757) can only generate upstream junction reads (UJC) for the alternatively spliced middle exon but not downstream junction reads (DJC). As a result, the estimate of MATS is less accurate than that of rSeqDiff. rSeqDiff classifies this gene as model 1, which is consistent with the RT-PCR result. Cuffdiff 2 fails to test (it reports as ''FAIL'' [22]) this gene due to ''an ill-conditioned covariance matrix or other numerical exception prevents testing''. We also compare the estimates of all the gene between rSeqDiff (all reads) and Cuffdiff 2. Cuffdiff 2 fails to test (it reports as ''LOWDATA'', ''HIDATA'' or ''FAIL'' [22]) several hundred genes with relative complex isoform structures. Figure 3D shows the scatter plot of the log2 fold changes of transcript abundances between ESRP1 and EV estimated by the two approaches (genes with low read counts or failed to be tested by Cuffdiff 2 are excluded). Overall the two approaches agree well with each other (Pearson Correlation Coefficient = 0.834, Spearman Correlation Coefficient = 0.933), and the degree of agreement is generally higher when the alternative spliced transcripts are more differentially expressed: the Pearson Correlation Coefficient (PCC) and Spearman Correlation Coefficient (SCC) of transcripts classified in each of the three models are PCC = 0.685, SCC = 0.802 (model 0), PCC = 0.827, SCC = 0.932 (model 1) and PCC = 0.862, SCC = 0.954 (model 2). We also run rSeqDiff with different fractions of reads from the dataset to check for possible saturation ( Figure S3 and Table S4).
Analysis of the ASD dataset. Increasing evidence has indicated that alternative splicing plays an important role in brain development [23,24] and the pathology of many neurological disorders [25,26]. This dataset was published by Voineagu et al [19], where single-end RNA-Seq experiments were performed on three brain samples of Autism Spectrum Disorder (ASD) patients with down-regulated A2BP1 gene levels (a.k.a. FOX1, an important neuronal specific splicing factor that regulates alternative splicing in the brain) and three control brain samples with normal A2BP1 levels.
In [19], the authors separately pooled the reads for ASD and control to generate sufficient read coverage for the quantitative analysis of alternative splicing events (referred as ''pooled dataset'' below), and then used an exon-based method similar to MATS in their analysis and detected 212 significantly differentially spliced exons (belonging to 196 unique genes). As we have shown in the analysis of the ESRP1 dataset, the exon-based methods provide less accurate results for complex alternative splicing events and cannot infer the abundances of the isoforms, here we analyze this pooled dataset using rSeqDiff (detailed method is given in Text S1). rSeqDiff classifies 4,507 genes (with 6,850 transcripts) as model 0, 12,374 genes (with 19,556 transcripts) as model 1, 1,769 genes (with 5,848 transcripts) as model 2 (Table S7), and 7,349 genes (with 8,884 transcripts) are filtered out because they have less than 5 mapped reads in both conditions ( Figure S4). We also run Cuffdiff 2 [16,22] on this dataset with its default settings. We find Cuffdiff 2 to be relatively conservative for detecting differential expression of spliced transcripts and it only identifies 43 transcripts as significant under default settings (FDR,0.05). Figure S5 shows the scatter plot of the log2 fold changes of transcript abundances between ASD and control estimated by the two approaches (genes with low read counts or failed to be tested by Cuffdiff 2 are excluded). Similar to the analysis of the ESRP1 dataset, the two methods generate concordant results overall (PCC = 0.825, SCC = 0.937). The correlation coefficients for transcripts classified in each of the three models are PCC = 0.539, SCC = 0.796 (model 0), PCC = 0.847, SCC = 0.940 (model 1) and PCC = 0.854, SCC = 0.953 (model 2), which also show the same pattern as we observed in the ESRP1 dataset. We also run rSeqDiff on each individual biological replicate and get consistent results as the analysis on the pooled dataset (Table S6).
The authors of [19] tested 7 differentially spliced exons with relevant neurological functions using semi-quantitative RT-PCR assays, and validated 6 of them. Table 4 shows the ranking of these   genes by rSeqDiff and Cuffdiff 2 (The CDC42BPA gene was not validated in [19]). rSeqDiff is able to detect all the 6 confirmed genes as differentially spliced (model 2) and also gives a more meaningful ranking of these genes than Cuffdiff 2, which might be helpful for biologists to design follow-up experiments. We also compare the estimates of the exon inclusion levels of the six RT-PCR validated exons by rSeqDiff with the exon-based method in [19]. Five out of the six genes (except AGFG1) have concordant annotations for the skipped exons in the RefSeq annotation database are used in our analysis. Table S5 shows the comparisons between the two methods. Basically, rSeqDiff consistently recovers the results from the exon-based method in [19]. We further analyze the biological significance of the differentially spliced genes detected by rSeqDiff. The enrichment of gene ontology (GO) terms and functional categories for the 1769 genes classified in model 2 are tested. 88 GO terms related to biological processes (Table S8), 48 GO terms related to cellular components (Table S9) and 30 functional categories (Table S10) are significantly enriched at FDR,0.01 level. rSeqDiff captures majority of the relevant enriched GO terms reported in [19],  (Table S8 and S9). The result of enriched functional categories further confirms that alternative splicing is the top enriched functional category (p = 7:14|10 {133 ) (Table  S10). We further test the enrichment of genetic association disease classes (Table S11) and the tissue expression pattern of these genes (Table S12). Neuropsychiatric and neurological diseases are the two significantly enriched disease classes (p = 3:53|10 {10 and 6:64|10 {3 respectively, Table S11), and brain is the top enriched tissue (p = 2:76|10 {51 , Table S12). We also search the relevance of the top 400 differentially spliced genes with autism and other related neurological diseases in the NIH Genetic Association Database [27] and the two autism spectrum disorder genetic database, AutDB [28] and SFARI Gene [29]. Among these genes, 173 are found to be associated with a variety of neurological and/ or neuropsychiatric disorders from these databases, and 20 of them are associated with ASD (Table S7). Three of the 20 ASDassociated genes, NRCAM, EHBP1 and GRIN1, are validated by RT-PCR assays in [19]. All together, these results further  demonstrate the biological significance of the findings of the differentially spliced genes. Figure 5 shows three examples of genes with differential expression or differential splicing reported by rSeqDiff for the purpose of demonstrating rSeqDiff's capability in dealing with very complex isoform structures. In the first example ( Figure 5A-C), the NRCAM gene has five annotated alternative spliced isoforms ( Figure 5A) and the estimation of their abundances between ASD and control is shown in Figure 5C. Figure 5B shows the differentially spliced exon that was validated by RT-PCR in [19]. This gene encodes a neuronal cell adhesion molecule which involves in neuron-neuron adhesion and promotes directional signaling during axonal cone growth [30] and has been reported to be associated with ASD by two genetic association studies [31,32]. The second example is the BACE1 gene ( Figure 5D-F) with six annotated alternative isoforms. This gene has a complex isoform structure, with an alternative 59 splice site and an alternative 39 splice site (the part in the red box of figure 5D, enlarged in figure 5E). The estimates of the abundances of the gene and its isoforms are shown in figure 5F. This gene encodes the b-site APP cleaving enzyme 1 (BACE1), which plays an important role in the pathology of Alzheimer's disease [33]. Previous studies show that the isoforms of this gene have different enzymatic activities in the brain [34][35][36]. Although this gene has not been reported to be associated with ASD, several recent studies have showed that the expression levels of three BACE1 processed protein products, secreted amyloid precursor protein-a form (sAPP-a), secreted amyloid precursor protein-b form (sAPP-b) and amyloid-b peptide (Ab), have substantial changes in severely autistic patients [37][38][39][40]. The third example is the SCIN gene ( Figure 5G-I) with two alternative isoforms which differ by the mutually exclusive exons at the 59 end (the part in the red box of figure 5G, enlarged in figure 5H). This gene is identified as model 1 by rSeqDiff, which has a significant higher expression level in autism than control. Also, there is no read mapped to the short exon unique to NM_033128 at its 59 end ( Figure 5H), therefore this isoform is estimated to have low abundances in both conditions. This gene encodes Scinderin (also known as Adseverin), a calcium-dependent actin filament severing protein that controls brain cortical actin network [41].

Discussion
The two types of approaches for detecting differential transcription across multiple conditions, exon-based approaches and isoform-based approaches, each have their own strengths and weaknesses. Exon-based approaches do not rely on annotated fulllength transcripts and provide relatively accurate inference for the differential splicing of a local exon from a gene with relative simple isoform structure [9,11]. However, they cannot provide estimates of isoform abundances and provide less accurate inference for the differential splicing of genes with complex isoform structures. Isoform-based approaches can directly infer isoform abundances and are more accurate for estimating the differential splicing of multi-isoforms with complex splicing events. Since the final functional units are the protein isoforms translated from the alternatively spliced transcripts, isoform-based methods are more biologically informative for follow-up studies. However, isoformbased approaches may give inaccurate estimates if the annotation of full length transcripts is incorrect. We believe that isoform-based approaches will be increasingly used with the improvement of the transcript annotation databases.
One limitation of our approach is that it ignores the biological variations across biological replicates, which will be handled in our future work by extending our model. One way to handle biological variations is to use the negative binomial model as implemented in edgeR [42], DEseq [43], DSS [44] and Cuffdiff 2 [16], where an over-dispersion parameter is introduced and estimated using the empirical Bayes method that borrow information from all the genes. Another way is to use hierarchical Bayesian models, where choosing appropriate prior distributions and efficient parameter estimation (typically using Markov chain Monte Carlo (MCMC) algorithms) are challenging. It is also possible to extend our model to more complicated experimental designs such as crossed experiments by incorporating the covariates into the sampling rate matrix for each sample, since the hLRT is generally applicable to comparisons of complex models.  Figure S4 Scatter plots for examining differential expression and differential splicing. (A) Plot of the -log 10 based p values from the likelihood ratio test between model 1 and 0 v.s. the log2 fold changes of the estimated gene abundance, which can be used for visualizing differential expression of each gene. The red box highlights the SCIN gene that is shown as an example in the main text. (B) Plot of the -log 10 based p values from the likelihood ratio test between model 2 and 0 v.s the T values, which can be used for visualizing differential splicing of each gene. The red box highlights the BACE1 gene that is shown as an example in the main text.       Text S1 Supplementary methods and results.

Supporting Information
(DOC)