Two-step mixed model approach to analyzing differential alternative RNA splicing

Changes in gene expression can correlate with poor disease outcomes in two ways: through changes in relative transcript levels or through alternative RNA splicing leading to changes in relative abundance of individual transcript isoforms. The objective of this research is to develop new statistical methods in detecting and analyzing both differentially expressed and spliced isoforms, which appropriately account for the dependence between isoforms and multiple testing corrections for the multi-dimensional structure of at both the gene- and isoform- level. We developed a linear mixed effects model-based approach for analyzing the complex alternative RNA splicing regulation patterns detected by whole-transcriptome RNA-sequencing technologies. This approach thoroughly characterizes and differentiates three types of genes related to alternative RNA splicing events with distinct differential expression/splicing patterns. We applied the concept of appropriately controlling for the gene-level overall false discovery rate (OFDR) in this multi-dimensional alternative RNA splicing analysis utilizing a two-step hierarchical hypothesis testing framework. In the initial screening test we identify genes that have differentially expressed or spliced isoforms; in the subsequent confirmatory testing stage we examine only the isoforms for genes that have passed the screening tests. Comparisons with other methods through application to a whole transcriptome RNA-Seq study of adenoid cystic carcinoma and extensive simulation studies have demonstrated the advantages and improved performances of our method. Our proposed method appropriately controls the gene-level OFDR, maintains statistical power, and is flexible to incorporate advanced experimental designs.

informative cancer signatures than standard gene expression profiles [6,7]. The regulation of differential 45 alternative RNA splicing has been implicated to correlate with poor disease outcome in two aspects: 46 transcript level differential expression [8], and differentially spliced isoforms, e.g. changes in relative 47 abundance of isoform expression for a gene [9]. Analyzing across the whole genome presents numerous 48 obstacles, especially when the data should be analyzed in the context of more advanced experimental 49 designs, such as time-course, factorial or multiple confounding designs. 50 High-throughput next-generation sequencing technologies such as RNA-sequencing (RNA-Seq) provide 51 powerful tools for transcriptome analysis, reconstruction and quantification and offer an unprecedented 52 opportunity to discover novel genes, transcripts and splice variants underlying complex diseases. While 53 high-throughput RNA-Seq has become a standard for quantifying whole-transcriptome patterns of gene 54 expression, the statistical methods for analyzing alternative RNA splicing and investigating differential 55 expressed/spliced isoforms are still in their infancy. The early statistical methods for RNA-sequencing 56 analysis focused on gene level expression analysis and utilized count-based modeling strategies, such as 57 DESeq [10], edgeR [11], PoissonSeq [12], baySeq [13] and SAMseq [14]. These count-based strategies 58 are not applicable to transcript level data, where the alignments of read counts to overlapping transcripts 59 conditions. For the th condition there are % samples and the total number of samples is = ' + ) + ⋯ + 131 + . For a given gene m we assume that there are Lm isoforms and we denote %-./ as the abundance in the 132 log scale of the th isoform of gene m in sample nested in condition ( = 1, 2, … , , = 1, 2, ⋯ , % , 133 = 1, 2, … , / , = 1,2, … , ). For simplicity in notation throughout the following, we assume we are 134 discussing gene m and drop the fourth subscript m. We assume %-. follows a normal distribution and can 135 be modeled using the following linear mixed effects model for two-factor split-plot design with unequal 136 variances among the effects of isoforms: 137 %-. = : + .
; is the logarithm of the 139 expected relative expression of the th isoform. % < is the logarithm of the fold change in overall expression 140 of the given gene under condition . %.
;< is the effect that condition has on the relative expression of the 141 th isoform. -(%) is the effect of the th sample (whole plot), nested within the th condition and we assume 142 that -(%)~ (0, E ) ). %-. is random error and -%.~( 0, . ) ). -(%) and %-. are mutually independent. 143 The mixed model (1) contains the effects of conditions and isoforms and the condition by isoform 144 interaction as fixed effects and the subject effects as random effects. 145 146 Let's denote the observations of the th sample (whole plot), nested within the th condition as %-= 147 ( %-' , %-) , ⋯ , %-G ) H . Then the covariance structure for %-is 148 (2) 149 reduced to compound symmetry when ' = ) = ⋯ = G , and is generalized to unstructured when there 151 are no constraints on the covariance elements in (2). 152 153

154
To identify isoforms that are differentially expressed or spliced among different conditions, we will 155 consider two types of hypothesis tests as described below: 156 Type 1: Identify genes whose isoforms are differentially expressed or differentially spliced (either Model 157 1 or Model 2 genes in Fig 1) 158 Type 2: Identify genes whose isoforms are differentially spliced only (Model 2 in Fig 1) 160 ;< = 0 (4) 161 We first fit the linear mixed effects Model (1), and then perform likelihood ratio tests (LRT) of the 162 coefficients to identify genes with differentially expressed/spliced isoforms. 163 164 A two-step hierarchical hypothesis-testing framework 165 Many existing approaches for detecting differentially expressed isoforms treat each isoform as a single 166 genomic feature and perform the hypothesis tests for all isoforms simultaneously. These approaches usually 167 lack statistical power due to the large number of hypothesis tests performed at the same time. A two-stage 9 procedure was proposed to test differentially expressed gene sets [26], which was later generalized to a 169 two-step hierarchical hypothesis set testing framework in microarray time-course experiments [32]. A 170 similar two-step approach (stageR) was proposed to control the gene level false discovery rate (FDR) and 171 boost power in analysis of differential transcript usage [27]. We will utilize a similar two-step hierarchical 172 hypothesis-testing framework for identifying isoforms of interest. For genes, the first step of the proposed 173 framework is to perform gene level tests to identify genes that have differentially expressed or spliced 174 isoforms, which we refer to as the screening tests. In the second step we examine only the isoforms of genes 175 that have passed the screening tests. 176 In the initial screening, we will test for two types of null hypotheses, i.e. (3) and (4), for identification of 177 differentially expressed isoforms or only differentially spliced isoforms. In the subsequent step we test 178 genes that passed the screening test to see whether each individual isoform is differentially expressed. We 179 propose two options: the first option is to perform the test using either the t-test or one-way ANOVA; and 180 the second option is to perform the test using the contrast (Wald-test) based on the linear mixed effects 181 model. The null hypotheses for testing the differential expression of the th isoform among J conditions are 182 described in the following equation: 183 Utilizing a two-step hierarchical hypothesis-testing framework, we will control for the overall false 186 discovery rate (OFDR) at the gene level, which was recommended over FDR because it focuses on the 187 inferential units of interest [26]. Compared with standard approaches that perform the hypothesis tests on 188 all isoforms simultaneously, the number of tests performed in the proposed framework will be dramatically 189 reduced, and consequently the statistical power is expected to be increased [32]. 191 We control the OFDR which is defined as the expected proportion of falsely discovered genes out of all 192 discovered genes:

Procedure to control for OFDR
where R is the total number of rejected hypotheses or discovered genes, and V is the number of discovered 194 genes where at least one null hypothesis (including screening and confirmatory hypotheses) was incorrectly 195 rejected. 196

197
We aim to perform two levels of inferences while controlling for the OFDR, i.e. test for differentially 198 expressed or differentially spliced genes and at the same time test for individual isoform differential 199 expression. The specific procedure is described as below in details: 200 (1) Screening stage: apply the Benjamini-Hochberg procedure [33] at level to the p-values of the 201 screening tests. Let R be the number of genes that pass the screening tests. 202 (2) Confirmatory stage: for each gene that passes the screening test, test the hypotheses for the 203 individual isoforms simultaneously, applying a p-value adjusting procedure on the p-values of the 204 tests such that the family-wise error rate (FWER) of these tests is controlled at level / . 205 Specifically we will control the FWER using three methods: Bonferroni, Holm, and Hochberg 206 methods. 207 It was proved that the above procedure controls the OFDR at level α under the condition that the individual 208 hypothesis tests in the second confirmatory stage are independent from all other screening tests [26,32]. 209 210 Application to an RNA-sequencing study of adenoid cystic carcinoma (ACC) 212 We applied the proposed linear mixed model approach to a study of adenoid cystic carcinoma where RNA-213 sequencing was performed on 20 ACC salivary gland tumors and five normal salivary glands. Details of 214 the RNA sequencing were described previously [28]. We present primary results using the covariance 215 structure with unequal error variances as shown in Equation (2). Cufflinks was used to estimate and quantify 216 the isoform abundance in FPKM prior to the differential expression/splicing analysis. We examined 2850 217 genes having two or more detected isoforms in our two-step analyses, and compared results using our 218 proposed methods with other widely used methods for analysis of RNA-Seq data: Cuffdiff [17], Limma, 219 and t-test with BH correction. 220 221 We performed differential isoform expression/splicing analysis between 8 patients who are free of cancer 222 vs. 6 patients who are not among whom have complete follow-up. The multi-dimensional multiple 223 comparisons were corrected using our proposed procedure, and statistical significance was considered at 224 OFDR=0.10 due to the exploratory nature of this study where the sample sizes and effect sizes are limited. 225 While no isoforms are called significantly different between the two patient outcome groups using the three 226 alternative methods ( Cuffdiff [17], Limma, and simple t-tests), our approach using the covariance structure 227 with unequal error variances identified 11 genes that have either differentially spliced or differentially 228 expressed isoforms using a simultaneous test for the main group effect along with the interaction effect 229 (Table 1). In addition, a Type 2 screening test identified 4 genes that are differentially spliced using a 230 screening test for the condition by isoform interaction term. A second stage confirmatory test using the 231 Hochberg method further identified 12 isoforms from 9 genes that are differentially expressed (Table 2). 232 The confirmatory test using the Bonferroni or Holm methods yielded the same results as the Hochberg 233 method. We also performed the differential alternative RNA splicing analysis using the proposed linear 234 mixed model with an unstructured covariance structure for the exploratory purpose, and summarize the 235 results in Supplementary Tables 1-2. The limitation for specifying the unstructured covariance matrix lies 236 in the possible inflated type I error rates due to the large number of covariance parameters included in the 237 estimation process. 238 239  that there is no differential splicing, the distributions of the relative isoform abundance are expected to be 253 similar, and the lines representing the mean isoform expression profile for the two conditions, normal vs. 254 tumor, are expected to be parallel. We have observed considerable non-parallel patterns for the 4 255 differentially spliced genes that passed Type 2 screening test: C3orf17, PRKAA1, RRBP1, and TRIP12. in ACC patients who are not free of cancer, which is a surrogate for poor ACC outcomes (Table 2 and 283 We also applied the proposed approach to identify differentially expressed/spliced isoforms that are 284 associated survival outcome in a large cohort of pediatric AML from the NCI/COG TARGET-AML 285 initiative [29]. The RNA-seq data at exon, isoform and gene levels as well as the clinical data for the AML 286 patients are downloaded from NCI/COG TARGET website at https://ocg.cancer.gov/programs/target. We 287 analyzed a subset of 234 patients who have clinical outcomes with 81 cases (relapsed within 3 years) and 288 153 controls (CCR for at least 3 years). After removing those with low abundances and those that are not 289 in the coding region, we analyzed a total of 35397 isoforms representing 8058 genes. The multi-dimensional 290 multiple comparisons, again, were corrected using the two-step procedure, and statistical significance was 291 considered at standard cutoff of. 0.05 for OFDR. We identified 786 genes that have either differentially 292 spliced or differentially expressed isoforms using the Type 1 screening tests that simultaneously test for the 293 main group effect along with the interaction effect; and 856 genes that are differentially spliced using the 294 Type 2 screening which tests for just the condition by isoform inter-action term. Table 3 summarizes the 295 distribution of the genes detected using our method that passed either Type 1 or Type 2 screening tests 296 grouped by the number of studied isoforms for each gene. The union of the two lists of genes is presented 297 in Supplementary differentially expressed based on the aforementioned two lists of the genes, respectively. Note that the 300 majority of the genes from Type 2 screening test cannot be identified by the traditional gene level 301 differential expression analysis. We performed the t-test with BH adjustment on the gene expression data 302 of these 856 genes and none of them are called significant at level FDR=0.05. Figure 3 shows the isoform 303 expression profiles of two genes, EEF1B2 and RUNX2. While there seems no difference in the overall 304 expressions of the genes there appears to be interesting splicing events for these two genes. EEF1B2 is one of the Eukaryotic translation factors that have received much attention recently with regards to their role in 306 the onset and progression of different cancers [45]. For this gene, two isoforms ENST00000435123 and 307 NEST00000455150 are significantly up-and down-regulated in cases as compared to controls while the 308 rest of 6 isoforms do not seem to differ between the two conditions. Another gene RUNX2 is a member of 309 RUNX family proteins that are generally considered to function as a tumor suppressor in the development 310 of leukemia. Our approach has identified one isoform NEST00000371436 that is significantly down-311 regulated in the cases as compared to the control while the rest of the isoforms showed no difference 312 between the two conditions. As a comparison to our approach we applied t-tests with BH multiple 313 comparison adjustment to the same data set. We found that only 13 isoforms (representing 12 unique genes) 314 were significantly differentially expressed between the two conditions at the level FDR=0.05. 315  322 We performed a simulation study to verify that our proposed two-step procedures with different options for 323 the confirmatory stage inference properly control the OFDR and to compare their statistical powers. In 324 order to capture the dependence structure of all the isoforms for each gene, we conducted the following 325 simulations based on the data set used in Section 3.2, where we considered two conditions, free of cancer 326 versus not. We considered Type 1 screening test and used the covariance structure with unequal error 327 variances as shown in Equation (2). The results for Type 2 screening should be similar. 328 In the case of two conditions ( = 2) we re-write model (1) as the following (7)  ;< if = 2, ⋯ , .
(9) 337 The matrix form of (7) can be written as 338 ;< ⋯ )G ;< ) H , 340 There are 2 parameters : , ) < , ) ; , ⋯ , G ; , )) ;< , ⋯ , )G ;< corresponding to the fixed effects and + 1 344 parameters in the variance-covariance matrix (2) for a full model, which we refer to as the model for 345 simulating a false null hypothesis gene (FNHG) or simply a full model. If the null hypothesis (3) holds, i.e. 346 ) < = )) ;< = ⋯ = )G ;< =0 then the number of parameters corresponding to the fixed effects is reduced to 347 where the variance-covariance matrix (2) keeps the same. We call this model as that for simulating a null 348 hypothesis gene (NHG) or a reduced model. 349 Our approach to simulation is to choose one gene with L isoforms, which we refer to as a template gene 350 and fit a full model and a reduced model to the data of the gene. Based on the estimated parameters of the 351 full model and the reduced model we simulate a number of RNA-Seq datasets with =1000 genes, L 352 isoforms for each gene, and two conditions ( = 2). 353 We conduct the simulation in three scenarios corresponding to three different effect sizes: small, medium 354 and large. In each scenario we simulate data for V NHGs and − V FNHGs from multivariate normal 355 distribution based on model (10). To accommodate various patterns of the differentially expressed isoforms 356 in real conditions, we simulate data for two types of FNHGs which we refer to as full FNHGs and partial 357 FNHGs; each accounts for half of the − V genes. The isoforms of a full FNHG are all assumed to be 358 differentially expressed, i.e. the mean difference (9) is not equal to zero for every isoform, whereas a partial 359 FNHG is defined such that only half of the isoforms of gene are differentially expressed. Data for the NHGs 360 are simulated based on the estimated parameters of the reduced model from real data and these parameters 361 are kept the same among the three scenarios with different effect sizes. In generating the data for the FNHGs 362 the parameters : and ) ; , ⋯ , G ; as well as the parameters for the covariance matrix are also kept the same 363 among the genes and among the three scenarios. The difference lies in the way ) < , )) ;< , ⋯ , )G ;< are specified 364 for the simulations. Suppose that the estimates of these parameters for the full model are ) < , )) ;< , ⋯ , )G ;< . If 365 ) < is greater than 0, then ) < 's for the full FNHGs are randomly chosen (uniformly) from the 366 intervals[0, ) < ], [ ) < − ) < /2, ) < + ) < /2], and [ ) < , 2 ) < ] corresponding to the small, medium and large 367 effect size scenarios. If ) < is less than 0 then the three intervals are [ ) < , 0] , [ ) < + ) < /2, ) < − 368 ) < /2], and [2 ) < , ) < ]. )) ;< , ⋯ , )G ;< are chosen in the same way. The parameters for simulating partial 369 FNHGs are chosen in the same way except that we force ).
;< = − ) < for = [( + 1)/2] + 1, ⋯ , , 370 where [ ] is the largest integer not greater than . This will force half of the isoforms not to be differentially 371 expressed. We choose a series of V ranging from 0 through , and for each V we simulate 100 datasets, 372 each includes =200 independent sample units for each condition. The level is set at 0.05 and the 373 simulation results for each V are the averages of the results from the 100 replications (simulated datasets). 374 We applied our two-step procedure to each simulated dataset with Type 1 screening test, two options on 375 the test statistic (t-test and Wald-test) and three options for controlling the FWER (Bonferroni, Holm, and 376 Hochberg methods) that were employed in the confirmatory stage. We compared our proposed two-step 377 procedures to standard isoform-by-isoform analysis with t-tests and BH correction. We called this analysis 378 as simple BH that is borrowed from [32]. There is no gene level screening test for simple B-H method. In 379 order for comparing this method with our two-step procedures we define a screening test for the simple BH 380 slightly improved power than our two-step procedure in most of settings, which is partly attributable to the 419 over inflated OFDR. When the percentage of NHGs is large (e.g. greater than 90%) the difference in power 420 between the simple BH and our two step procedure is often small. In the setting of small effect size with 421 the simulation based on gene MDM2 (Fig 4) which has the weakest signals among all simulation settings, 422 we observed that our two-step procedure shows an improved power (II) compared to the simple BH method 423 although the OFDR of simple BH is still overly inflated. The difference in power (II) substantially increases 424 as the number of NHGs increases. It implies that when the effect sizes are small and the number of NHGs 425 are large (which is common in real data sets) our method provides more statistical power than the simple 426 BH. This partially explains why in the real data example the simple BH method could not identify any 427 significant isoforms. 428 We examined the differences caused by using different tests (Wald-test or t-test) in the confirmatory stage 429 by comparing Figs 3, S2 and S3 with S4 -S6 Figs. The OFDR is well controlled regardless of whether the 430 Wald-test or the t-test is used in the confirmatory stage and the differences in powers are negligible for all 431 the simulation settings. 432 We repeat the evaluation of OFDR and power of our two-step procedure and simple BH under a setting of 433 is reduced. The power of the simple BH method has more reduction as compared to our two-step procedure 435 while the OFDR of the simple BH keeps inflated in all the settings. We also notice that when the percentage 436 of NHGs is large, the OFDR of our two-step procedure is also slightly inflated. This is because the 437 likelihood ratio tests using the standard Chi-square distribution for fixed effects in linear mixed effects 438 models tend to be "anticonservative" and their type I errors are sometimes inflated, which has been reported 439 tools that appropriately account for the unique data characteristics, e.g. differential expression/splicing 449 patterns. In this work, we developed a unified approach for simultaneously assessing genes with 450 differentially expressed or spliced isoforms, which also employed appropriate hierarchical hypotheses-451 testing framework to efficiently alleviate the multiple comparison burden. Our approach provides several advantages in modeling the patterns of differential expression/splicing while 469 appropriately accounts for the correlation among different isoforms. First, the utilized linear model 470 approach fully characterizes three different types of genes with distinct alternative RNA splicing patterns 471 with differential expression and or differential splicing profiles. We proposed two types of screening tests 472 using a linear mixed effects model approach that thoroughly characterizes and differentiates three types of 473 genes related to alternative RNA splicing events. Specifically, our approach differentiates among genes that 474 have no differentially expressed isoforms, genes that have differential expression of isoforms but no 475 differential splicing, and genes with differentially spliced isoforms with differential expression at the 476 isoform level but not necessarily at the gene level. Our proposed type 1 screening tests will identify genes 477 that have either differential isoform expression or differential isoform switches at the same time. The 478 significant genes detected using the type 1 screening tests provide a solid foundation for downstream 479 pathway analysis in order to identify biologically meaningful gene sets that are enriched for genes with 480 different types of alternative splicing events. This overcomes the limitation of existing pathway-based alternative splicing analysis which requires an extra step to combine the significance in differential 482 expression and splicing [47]. Second, our mixed model including fixed and random effects effectively 483 accounts for the biological correlation structure among isoforms of the same gene. These correlations are 484 often times ignored in other existing statistical tools, which may yield false positive or false negative results. 485 Third, we conducted extensive simulation studies and demonstrated that our proposed linear mixed effects 486 modeling framework coupled with the use of two-step hierarchical hypothesis testing procedure 487 appropriately controls the gene-level overall false discovery rate (OFDR) which also provides improved 488 statistical power and computational efficiency to discover genes with significant differential 489 expression/splicing patterns or genes with isoform switches. Lastly, the application to two real RNA-Seq 490 studies have demonstrated the advantages and improved performances of our method in the differential 491 alternative RNA splicing analysis. 492 In summary, we developed a unified approach for simultaneously assessing genes with differentially 493 expressed or spliced isoforms that offers much more flexible and improved performance than existing 494 methods which can 1) incorporate a broader range of models to analyze differential splicing and expression; 495 2) include testing between conditions that have multiple samples within each condition and allow more