Dose-Dependent Regulation of Alternative Splicing by MBNL Proteins Reveals Biomarkers for Myotonic Dystrophy

Alternative splicing is a regulated process that results in expression of specific mRNA and protein isoforms. Alternative splicing factors determine the relative abundance of each isoform. Here we focus on MBNL1, a splicing factor misregulated in the disease myotonic dystrophy. By altering the concentration of MBNL1 in cells across a broad dynamic range, we show that different splicing events require different amounts of MBNL1 for half-maximal response, and respond more or less steeply to MBNL1. Motifs around MBNL1 exon 5 were studied to assess how cis-elements mediate the MBNL1 dose-dependent splicing response. A framework was developed to estimate MBNL concentration using splicing responses alone, validated in the cell-based model, and applied to myotonic dystrophy patient muscle. Using this framework, we evaluated the ability of individual and combinations of splicing events to predict functional MBNL concentration in human biopsies, as well as their performance as biomarkers to assay mild, moderate, and severe cases of DM.


Author Summary
Our studies provide insight into the mechanisms of myotonic dystrophy, the most common adult form of muscular dystrophy. In this disease, a family of RNA binding proteins is sequestered by toxic RNA, which leads to mis-regulation and disease symptoms. We have created a cellular model with one of these family members to study how these RNA binding proteins function in the absence of the toxic RNA. In parallel, we analyzed transcriptomic data from over 50 individuals (44 affected by myotonic dystrophy) with a

Introduction
Alternative splicing increases the coding potential of a gene and importantly, allows for regulation of expression of specific isoforms in a developmental and tissue-specific manner. Regulation of alternative splicing is integral for a variety of biological processes including erythropoiesis, neuronal differentiation, and embryonic stem cell programming [1,2]. Misregulation of alternative splicing alters isoform ratios and can cause cancers, muscular dystrophies, and neurological diseases [1][2][3][4][5]. Isoform ratios can be altered by differential expression of tissue specific factors [6,7]. Active areas of investigation in the alternative splicing field include how concentrations of alternative splicing factors affect isoform ratios, and what properties determine whether an isoform is responsive to a broad range of splicing factor concentrations or sensitive to a threshold level of activity.
To address these questions, we focused on alternative splicing regulation by MBNL1, an RNA binding protein involved in muscle, heart, and CNS development [8,9]. In the disease myotonic dystrophy, the activities of MBNL1 protein and its paralogs, MBNL2 and MBNL3, are reduced via sequestration to toxic RNAs, resulting in a titration of MBNL proteins away from their pre-mRNA substrates. MBNL proteins bind to YGCY motifs using Zn finger RNA binding motifs [10,11], including sequences found in the toxic RNAs (expanded CUG and CCUG repeats) that cause both myotonic dystrophy type 1 (DM1) and myotonic dystrophy type 2 (DM2) [12][13][14][15][16]. As has been observed in other microsatellite repeat diseases, the average length of the repeat in patients is loosely correlated with age of onset; repeat lengths also vary across cells and tissues within a single patient, potentially sequestering MBNL proteins to differing degrees [17]. MBNL splicing targets are differentially affected by the disease; DM1 patient samples exhibit a broad range of alternative exon inclusion levels, as compared to control samples containing low numbers of repeats [17][18][19], and MBNL1-dependent splicing events were shown to behave differently in response to different doses of MBNL1 protein [20].
Tunable systems can be used to control expression of specific genes and they can be used to produce a range of mRNA and protein isoforms, and phenotypes that change gradually or sharply, in response to stimuli [21]. Here, we use a tunable system to demonstrate that splicing events are responsive to MBNL1 dose, and characterize their behavior over a range of MBNL1 protein levels. We found that dose-responsive behaviors differ in their steepness, and differ in the protein concentration required to reach half-maximal splicing activity. By further studying a splicing event that is highly conserved between species, we assessed the importance of MBNL1 cis-element organization in controlling dose-response behavior. Similar to the cellbased studies, analysis of muscle biopsies from DM1 patients revealed that splicing events are not perturbed uniformly; here we characterize those splicing events across patients and calculate the inferred MBNL dose, or predicted free MBNL protein concentration, in each patient. These studies lead to predictions for which alternative splicing events are the most robust disease biomarkers.

A cell model with a broad concentration range of MBNL1
A tetracycline-inducible Flp-In T-REx system (Invitrogen) was utilized to express HA-tagged MBNL1 in HEK293 cells (Fig 1A), to allow precise control of HA-MBNL1 concentrations as a function of doxycycline (dox) (0.5 ng/ml to 10 ng/ml). HA-MBNL1 expression covered a broad range (Fig 1B). Consistent with the low detection of MBNL1 protein in untreated HEK293 cells, low MBNL1 protein levels have been described in the literature in the kidney compared to skeletal muscle and heart in human [8,16]. Knock down of MBNL1 and sequestration of MBNL1/2/3 proteins or expression of 960 CUG interrupted repeats resulted in only minimal or modest changes in the percentage of splicing inclusion ΔC for the splicing events tested (Figs 2 and 3), also consistent with low MBNL protein expression in HEK293 cells and that MBNL1 is the predominant paralog. Quantification of transcripts from HEK293 RNA seq data confirmed that MBNL1 transcripts were expressed at higher levels than MBNL2/3 (S1 Fig). This system allowed us to generate a twenty-fold dynamic range of MBNL1 protein concentration and activity in cells.
Expression of MBNL1 was achieved at lower concentrations of doxycycline (dox) than typically used for tet-on experiments (> 5 ng/ml). A sigmoidal-shaped MBNL1 concentration curve was observed when steady-state MBNL1 levels were plotted against the log of the dox concentration ( Fig 1C). Maximal MBNL1 protein accumulation was approximately ten-fold higher than endogenous levels ( Fig 1B) and the concentration was reduced two-fold using siR-NAs to knock down the expression of endogenous MBNL1 to achieve the minimal concentrations ( Fig 1B).
A schematic representation of a typical dose-response curve used to study a hypothetical MBNL1 regulated cassette exon, where MBNL1 promotes skipping, is shown in Fig 1D. The percent spliced in value (C) is plotted as a function of the log of the MBNL1 dose. This schematic curve was fitted to a four-parameter dose-response equation so that parameters that relate to biological phenomena, i.e. concentration or EC50 and steepness of response, could be used to describe the dose-response data ( Fig 1D). The slope of each curve provides information about the responsiveness of an event over the applied MBNL1 concentration range while the EC50 describes the relative amount of protein required for splicing activity.
MBNL1 target exons exhibit dose-dependency, and differ in the amount of MBNL1 required for regulation HA-MBNL1 levels were controlled with the dox-inducible system to generate dose-response curves of seven MBNL1 regulated cassette exons within genes that are expressed in HEK293 cells (Fig 2). Six of the selected events were previously validated as MBNL regulated exons and are well characterized in the DM1 field: MBNL1 exon 5, MBNL2 exon 5, ATP2A1 exon 22, FN1 exon 25, INSR exon 10, and NFIX exon 7 [10,[22][23][24][25][26][27][28]. We also studied CLASP1 exon 19, an exon that was identified to be dysregulated in the DM1 patient data set that is described below. The CLASP1 event, to our knowledge, has not been described previously. MBNL1 promotes exon inclusion of the CLASP1 exon in this study likely through binding to YGCYs within the 5' region of the intron located downstream of the regulated exon; this observation is consistent with MBNL1 binding adjacent to enhanced exons [10,28,29]. A previous study reported that Dox (ng/ml) was titrated to induce MBNL1 expression in HEK293 cells or treated with control siRNA or siRNA against MBNL1. siRNA treatment is not shown for events with transcripts that would be targeted by knock-down (MBNL1/2). RNA was isolated from the cells, RT-PCR was performed and DNA products were resolved on a native gel and isoform ratios quantified. Ψ values were plotted against the log (HA-MBNL1) treatment and fitted to a four-parameter dose-response curve. Splicing assays and western blots for MBNL1 protein quantification were performed in triplicate. (H) Curve fitting parameters are shown in the table.
MBNL1 promoted skipping of an exon within the CLASP1 gene, but this event is a different exon; the YGCY motifs for the previously reported event are upstream [23].
Comparisons of dose-responses for each event revealed differences in the manner of regulation of each exon by MBNL1 (Fig 2). Exons in MBNL2, MBNL1, and ATP2A1 required less MBNL1 protein than the other four events, yet NFIX required approximately three times more MBNL1 protein than MBNL2. Estimated EC50 values partially correlated to previously reported affinity measurements of MBNL1 protein bound to minimal RNAs. For example, exons with lower EC50s (ATP2A1, MBNL1, and MBNL2) tended to have stronger binding (K D s of 15 nM, 11 nM, and 5.8 nM, respectively [10]); exons with higher EC50s (INSR, NFIX) tended to have weaker binding (K D s of 120 nM and 55 nM, respectively) [10], [11]). Absolute values of slopes ranged from 1.0-6.0, with values from 1-1. Binding site composition and organization mediate the dose-response behavior of MBNL1 exon 5 The relationship between MBNL binding site organization and dose-response behavior is likely complex and depends on multiple factors, including trans-factor environment and organization of other cis-elements. However, by studying the splicing behavior of sequence variants of a single event, we could limit the impact of these variables. We mutated cis-elements in the intron upstream of MBNL1 exon 5 to evaluate how putative MBNL binding sites affect doseresponse behavior (Fig 3A). This event contains clusters of YGCYs within an intronic splicing silencer (ISS) that directly precedes regulated exon 5 [24,30] (Fig 3A). Previous work showed MBNL1 binding to these YGCYs in mouse C2C12 myoblasts [29] (Fig 3A). The ISS is considered to be "ultraconserved", with 100% sequence identity between human and mouse [31]. Previously, we showed that individual regions containing YGCYs were unnecessary to achieve maximal splicing activity when eGFP-MBNL1 was overexpressed from a plasmid [24].
We asked whether altering the ultraconserved ISS YGCY organization altered MBNL1 dose-dependent splicing activity. Deletion of regions within the ISS (del1-5) altered YGCY organization and spacing of splicing signals, including a distant branch-site and the 3' splice site ( Fig 3A). Dox was titrated to induce MBNL1 expression, the del mutant pre-mRNA splicing reporters were transiently transfected, and the percentage of spliced transcripts including exon 5, or C, was quantified and fit to curves (Fig 3 and Table 1). Maximum C values were similar to WT for all del mutants (CUG condition, Fig 3), indicating that the mutations did not affect baseline levels of splicing. Minimum C values, or those at the highest MBNL1 dose, were similar to WT for all deletion mutants except del 3, similar to previous observations using high MBNL1 concentrations [24]. However, the shapes of the dose-dependent curves were distinct across the MBNL1 concentration range. The central deletion mutant, del3, had the largest effect on the dose-response. This 18 nucleotide deletion resulted in removal of a single YGCY in the center of the ISS, and led to both a shallower slope and a greater EC50 relative to WT. To rule out additional explanations for why del3 exhibited a distinct dose-response, another minigene reporter, 3M, was created in which the YGCY (UGCGCU) motif within the del3 region was mutated to a sequence that MBNL does not bind, YCCY (UCCCCU) [32]. The doseresponse for 3M was similar to that of del3.
In contrast, del4, a deletion mutant lacking the YGCY 3' of del3, exhibited no significant changes in dose-dependency parameters. Another mutant, 4M, in which the del4 YGCY was  [24]. Asterisks indicate C-T mutations from C2C12 cells (CLIP data, [29] (B-G) Mini-gene reporter dose-response curves were plotted (Ψ verses the log ([HA-MBNL1]), determined by western blot) for (B) del1 (C) del2 (D) del3 (E) del4 (F) del5 (G) 3M in triplicate. The WT (blue) dose-response curve is included with each deletion mutant (red) for comparison. Quantification of the upper band (exons 4-5-6) and lower band (exons 4-6) were used to determine Ψ. A transiently transfected HA-MBNL1 plasmid was used to achieve the highest MBNL1 dose in this experiment and CTG 960 transient transfection was used to achieve the lowest levels of functional MBNL proteins through sequestration. Representative splicing gels are shown. mutated, also exhibited dose-dependent behavior similar to that of WT (S3A Fig). These results indicate that the cis-element organization of this region, in particular the sequence around the central YGCY, may mediate specific dose-response characteristics. Most alterations to binding motifs in this sequence space led to changes in dose-dependent behavior, including reduced slope and increased EC50, potentially through reducing cooperativity or changing RNA structure.

MBNL concentration can be inferred using Ψ from multiple splicing events
In the HEK293 system, we observed that C of each splicing event exhibits a characteristic sigmoid shape with respect to MBNL1 concentration. This well-controlled system allowed us to derive these relationships, and directly measure functional MBNL1 levels by Western blot. However, a major goal in the DM field is to estimate the functional, non-sequestered concentration of MBNL in tissue of DM patients. This metric is impossible to obtain from tissue using current technologies, as free versus sequestered pools of MBNL are dynamic. However, the dose-dependent curves we characterized suggested that C could be used to infer the concentration of functional MBNL in cells. While we observed that cis-element organization plays an important role in dictating the shape of each dose-response curve, the trans-factor environment also likely plays a role, and therefore dose curve parameters will vary across tissues.
First, we assessed whether it was possible to infer MBNL concentration, given a set of C values measured across a range of MBNL doses. We framed the task of estimating [MBNL] and all sigmoid curve parameters, C min , C max , EC50, and slope, as a Bayesian estimation problem. Since we can compute the likelihood of observing C for all seven splicing events from HEK293, given any set of values for [MBNL], C min , C max , EC50, and slope for each splicing event, Bayes' Rule allows us to invert the problem to obtain the posterior probability distribution of each of those parameters, including the underlying MBNL concentration. Indeed, when estimated using this approach, inferred [MBNL] correlated extremely well (R 2 = 0.993) with measured MBNL levels relative to GAPDH, as assessed by Western blot (Fig 4A). Interestingly, a computationally simpler approach using the average splicing dysregulation across all splicing reporters, mean Δ C, also correlates extremely well with MBNL levels as assessed by Western blot (R 2 = 0.994, Fig 4B).

Estimation of functional MBNL concentration in DM1 tibialis biopsies
After using the dose-dependent relationship of C to [MBNL] in the HEK293 system to develop a method to estimate [MBNL] using C alone, we sought to apply this approach to measuring   Table). Samples are sorted by [MBNL] inferred across the horizontal axis, and splicing events are sorted by EC50 along the vertical functional MBNL concentration in the tibialis anterior skeletal muscle, a tissue preferentially affected in DM1. Cognizant that the behavior of each dosing curve would differ in tibialis as compared to HEK293, we separately characterized dosing curves in tibialis, by analyzing RNAseq data from 44 DM1 patients (including 5 patients with proto-mutations, or <100 CTG repeats as assessed by peripheral blood) and 11 healthy controls (sample cohort described in [17]). MISO was used to derive transcriptome-wide estimates for C values of alternative cassette exons [33]; we focused on 46 cassette exon splicing events exhibiting significant differential splicing regulation in at least 25% of DM1 samples (MISO Bayes Factor ! 5 when comparing DM1 to non-DM1, and |Δψ| > 0.2) (S1 and S2 Tables). MISO C estimates were in agreement with values obtained from RT-PCR [17]. While in the context of DM1 many factors potentially influence ΔC, the panel of events identified in this study is consistent with a predominant contribution from MBNL sequestration. YGCY motifs were enriched upstream of the regulated exon for events that are more included in patients (p = 0.001) while YGCY motifs were enriched downstream of the regulated exon for events that are more excluded in the patients (p = 0.006), consistent with previously described MBNL1 binding site maps and observations connecting MBNL1 putative motif location and association with either inclusion or skipping of the alternative exon [10,28,29]. Additionally, previous studies indicate that MBNL depletion accounts for the majority of splicing events observed in the HSA LR mouse model of DM1 [28].
Six out of seven splicing events studied in the HEK293 system were also observed to be dysregulated in DM1 tissue; FN1 was excluded because it lacked sufficient RNA-seq coverage to accurately infer splicing. To characterize dosing curves in tibialis, we applied our Bayesian inference framework, which infers C min , C max , EC50, and slope for each splicing event in tibialis, as well as functional MBNL concentrations for each individual. As expected due to differences in trans-factor environment and expression level, among other variables, dosedependent curves were observed to be similar, but not identical between HEK293 and tibialis anterior (S5 Fig). Interestingly, while curve shapes were similar across both cell types, those in tibialis generally exhibited a broader dynamic range in C, as well as steeper slopes, suggesting that the trans-factor differences in tibialis result in an enhanced dependency of C on MBNL concentration.
Similarly to the HEK293 system, we observed that the intracellular concentration of functional MBNL ([MBNL] inferred ) correlated well with mean splicing dysregulation ( Fig 4C); this is simply a reflection of the fact that the Bayesian Inference approach aggregates C from multiple events to generate its estimates. Ordering individuals by relative [MBNL] inferred resulted in non-DM1 individuals grouping together with individuals carrying DM1 proto-mutations, consistent with limited MBNL sequestration occurring in this subset of DM1 patients (Fig 4D). Some events showed dysregulation in almost all patients (86% of patients exhibit dysregulation for INSR), and others showing dysregulation in a smaller subset of patients (48% of patients exhibit dysregulation for ATP2A1), consistent with some events exhibiting higher EC50 values than others (Fig 4D and S1 Table). C for all samples and all splicing events, along with estimated sigmoid curves, are shown in S6 Fig. Although (17), it is well established that repeat lengths can differ from cell to cell of the same individual, and that long repeats are extremely difficult to size accurately.
axis. The |slope| for each event is indicated on the right-hand vertical axis. White boxes with slashes denote samples with insufficient read coverage to infer Ψ. Events studied in HEK293 are marked with an asterisk.

Accurate inference of MBNL concentration is splicing event and disease severity dependent
We sought to assess the suitability of each splicing event as a potential biomarker for levels of functional, non-sequestered MBNL, a key metric likely correlated to clinical outcomes in DM1. To simulate a hypothetical future clinical trial scenario in which we have estimated C min , C max , EC50, and slope using one cohort of DM1 patients, and would like to estimate [MBNL] for a new cohort of patients, we divided our samples into two groups to perform traditional cross-validation. We used 70% of the individuals to estimate C min , C max , EC50, and slope for every splicing event (training); these trained parameters could be used to plot sigmoid curves for each event (NFIX and CLASP1 shown in Fig 5A). We then assessed how well we could predict [MBNL] for the remaining 30% of samples (testing), by framing the question as another Bayesian inference problem. Here, we obtained the posterior probability distribution for We then used the trained parameters to estimate a posterior probability distribution for [MBNL], framing the question as another Bayesian inference problem. Posterior distributions for [MBNL], obtained using measurements of C for NFIX or CLASP1 are displayed in blue, green, and orange shading, for C values observed in 3 distinct biopsy samples (Fig 5A, right  panel). The "gold standard" estimates for [MBNL], previously calculated using 100% of the data, are displayed in blue, green and orange vertical lines. Apparent from these estimates is that the precision with which [MBNL] can be predicted is dependent on the shape of the sigmoid curve. For example, a precise value for [MBNL] in a severely affected individual (based on ADF measurements) is more difficult to obtain using NFIX as compared to CLASP1, because NFIX offers little discriminatory power between 0 and 0.5 [MBNL] units. In contrast, the predictive power of NFIX is slightly better than that of CLASP1 for mildly affected patients, as the sigmoid curve for NFIX changes more steeply than CLASP1 at the high end of [MBNL]. CLASP1 is an excellent predictor in the moderate range (green); the posterior probability estimate at the "true" [MBNL] value is close to 7 (Fig 5A).
These examples illustrate the principle that some splicing events are better than others for predicting [MBNL], and that their predictive power is dependent on disease severity. To quantitate the performance of each splicing event to estimate [MBNL], we calculated the mean predictive power of each splicing event to estimate "true" [MBNL] across 120 randomly sampled 70%/30% training/testing cohort divisions, where predictive power is defined as the posterior probability estimate at "true" [MBNL]. We calculated the mean predictive power for each event across several patient subgroups-severe ([MBNL] < 0.33), moderate (0.33 < [MBNL] < 0.66), and mild ([MBNL] > 0.66) DM1, as well as across the entire patient cohort. Splicing events best suited to predict [MBNL] in mild DM1 are distinct from those best suited to predict moderate or severe DM1 (Fig 5B). Interestingly, splicing events best suited to predict [MBNL] in moderate DM1 are also those best suited to predict [MBNL] across the entire spectrum (CLASP1, PDLIM3.2, CAC-NA2D1). Analyses of sigmoid curves for the best performers (S4 Fig) indicated that the best predictors exhibit non-zero slopes within the range of [MBNL] being predicted.

Measurement of therapeutic rescue requires proper biomarker selection, and usage of multiple biomarkers improves measurement accuracy
It is well established that the goal of many therapeutic efforts for DM is to increase the free, functional concentration of MBNL in cells and tissues; therefore, it is critical to be able to , were estimated for each splicing event and each sample, using 70% of the tibialis biopsies using a Bayesian inference framework. Sigmoid curves with 95% confidence intervals for NFIX and CLASP1, as estimated from 70% of the data, are shown, along with the Ψ values used to derive the curves (black points). Posterior distributions for [MBNL] were derived for 30% of the data (red points), and plotted for 3 specific samples (blue, green, and orange points). These distributions are also plotted on the right, along with the "gold standard" [MBNL] as estimated using 100% of the data (blue, green, and orange vertical lines). (B) The mean predictive power of each splicing event to predict "true" [MBNL] was calculated across 120 random subsets of training and test sets, where 70% of samples were used for training, and 30% for testing. Predictive power was defined as the posterior measure changes to this metric before and after a clinical study/trial. A hypothetical change in [MBNL] may or may not lead to changes in C for any given splicing event; the change in C is dependent on the shape of the sigmoid curve, as well as the starting point and ending point within the curve (Fig 5C). Thus far, our analysis has made use of only one biomarker at a time to estimate [MBNL]. We tested the ability to improve estimates by incorporating multiple biomarkers; we essentially computed the joint probability distribution of [MBNL] as estimated by each separate biomarker. We identified biomarker combinations that would maximize predictive power across all disease severities, again using 70%/30% cross-validation training/test sets. For each cross-validation trial, we first identified the single best biomarker with maximum predictive value for [MBNL]; we then identified the next best biomarker that minimized errors in concert with the best biomarker, then the third in concert with the first two, and so on. We performed this analysis across all 120 cross-validation trials, and recorded the biomarker sets with highest mean predictive power across all trials. Shown in Fig 5D are posterior estimates of [MBNL] for a severely affected biopsy and moderately affected biopsy, using the best 1, 2, 5, or 10 biomarkers ( Fig 5D). As expected, the probability distribution becomes sharper and more centered on the "true" [MBNL]. Adding more biomarkers can improve the estimation up to~30 markers (Fig 5E), after which further marker addition decreases performance. The decrease in performance is likely due to use of C values from splicing events whose inclusion is not as tightly linked to [MBNL].

Splicing events exhibit distinct responses to MBNL concentration
We studied the behavior of 7 splicing targets in a cell line we developed, in which MBNL1 expression could be titrated across a 20-fold dynamic range (Fig 6A). Each splicing event exhibited a unique, sigmoidal dose-response curve, with distinct EC50 and slope parameters. The steepness, or slope, of C relative to [MBNL1] varied across splicing events, likely due to cooperative binding of MBNL1 and/or other proteins to pre-mRNA. MBNL1 is known to regulate the splicing of other splicing factors such as PTB1 and hnRNPA1 and the stability of transcripts encoding splicing factors [29], so it is possible that secondary effects play a role in producing the observed sigmoid curves.
One might hypothesize that multiple binding sites are more likely to function in a cooperative fashion and that high affinity binding would lead to a lower EC50, but we did not observe a simple relationship between the number and location of MBNL binding sites and EC50 or slope. The location of MBNL bound cis-elements is similar when comparing MBNL1 exon 5 and MBNL2 exon 5, and they exhibited similar dose-dependent responses (Fig 2 and S2 Fig). The relationship between MBNL binding site number, location and affinity almost certainly play an important role in determining MBNL1 dose-responses, but it is also likely that other splicing factors, RNA structure and gene expression levels can modulate these curves. Other factors that regulate splicing in DM1 TA muscle, including CELF1 proteins, will also affect the curve parameters . CELF1 antagonistically regulates some MBNL regulated events including INSR [37][38][39].

The relationship between MBNL binding site organization and dosedependent behavior
We studied the role of specific binding sites upstream of MBNL1 exon 5 in mediating its MBNL dependent dose-response (Fig 3). Inclusion of MBNL exon 5 controls localization of the protein, and appears to be tightly regulated by MBNL and other proteins [29,40,41]. This region is an "ultra-conserved region" [31], and previous work has shown that in mouse myoblasts, MBNL1 protein interacts with the region mutated in these experiments [29]. Some YGCY motifs were dispensable for splicing when MBNL1 was high [24], but at low concentrations, some motifs were more critical than others.
The site that most dramatically altered dose-dependent behavior was a single, central YGCY. Affinity and cooperativity parameters were reduced for the del2/3 regions while del1/5 only had reduced cooperativity (Fig 6B). MBNL1 may bind to central sites first causing the surrounding sites to become more accessible, consistent with the footprinting data of this RNA (Fig 6B and S3B Fig). Structure probing indicated that the central YGCYs were single stranded in vitro ( [24] summarized in S3B Fig) and that the RNA was generally not structured. Notably, previous studies indicate that MBNL1 binds to regions of short single stranded RNAs through the Watson-Crick face of short unstructured RNAs [10,32,42]. Alternatively, differences in structure between the del mutants may have contributed to the observed differences in EC50 and slopes observed (S3B Fig) (mfold server, [43]). For example, the open structures for WT and del4 may facilitate a cooperative dose-response and increased sensitivity to MBNL1 concentrations, while the alternative structure may be a weaker substrate for MBNL1 (S3B Fig). Indeed, recent data suggests a model where MBNL1 interacts with the toxic CUG RNA when it is destabilized by U-U mismatches, allowing MBNL1 to access locally unfolded GC sites [44,45].

Alternative splicing event biomarkers
A key goal in the DM field is to accurately assess the remaining functional levels of MBNL proteins in human tissue. Using the cell-based system, we developed and validated a computational method to estimate MBNL1 concentration, as well as splicing curve parameters, using C alone. The strong correlation between estimated and measured MBNL1 protein levels in the cell-based system motivated us to take a similar approach in human tibialis biopsies from DM1 affected and non-affected individuals, where free MBNL proteins vary between individuals ( Fig  6A). We estimated tibialis-specific splicing curve parameters, which differed from those observed in the cell-based system, and also estimated the levels of functional MBNL protein remaining in tissue. As MBNL proteins regulate the splicing of many pre-mRNA substrates, a major unresolved question is which splicing events are most informative about disease status, and which splicing events in combination are best suited to measure therapeutic rescue in a clinical trial. It is well established that the remaining functional level of MBNL protein in tissue is informative about disease status, and here, we demonstrate that we can estimate this value using C alone, provided we first build a model using C from a number of splicing events across a spectrum of disease severities.
We showed that the range of [MBNL] across which C varies is different across events, and that this characteristic determines which subset of patients (mild, moderate or severe) for which that biomarker exhibits the greatest predictive power (Fig 6C). This has implications for being able to accurately place patient samples along the continuum of disease severities observed in DM, and for being able to accurately measure the extent of therapeutic rescue in a clinical trial. Furthermore, splicing biomarkers whose C values are invariant across a particular [MBNL] range do not provide as much information as biomarkers whose C values change dramatically across that same range. Finally, we showed that using multiple biomarkers in combination to predict [MBNL] holds more predictive power than using single biomarkers alone. The best biomarkers significantly overlap with those previously described [17]; we found the top set of five biomarkers for tibialis to be CLASP1, PDLIM3.2, CACNA2D1, MAPT.1 and CACNA1S (Fig 5D and S3 Table). Interestingly, CLASP1 is a novel event identified in our RNA-seq analysis; it is the top biomarker among all tested, and exhibits a broad range of splicing regulation across [MBNL] in both the cell based system and in tibialis. (Fig 2 and S6 Fig).

Dose-dependent behavior differs in different cellular contexts
The relationship between MBNL1 levels and C was previously investigated in myoblasts and mouse muscle [20]. C for five splicing targets was measured at five concentrations of MBNL1 in myoblasts, achieved by different doses of siRNA; C for similar events was measured at 0%, 50% and 100% MBNL1 level in mouse muscle, using normal, heterozygote, and homozygote MBNL1 knockouts. The relationship of C to [MBNL1] was slightly different in myoblasts versus muscle, depending on the specific exon, suggesting that a complex mixture of cis-and trans-factors mediates dose-dependent behavior, with differing stoichiometry in different cellular contexts. Here, we have separately measured C min , C max , EC50, and slope values for each splicing event in both HEK293 cells and human tibialis, and also observed that these parameters differ between HEK293 cells and human tibialis. These observations suggest that proper selection of splicing biomarkers for a given cell type requires characterization of biomarkers in that tissue, or a basic understanding of how C is modulated by the interaction of multiple trans-factors with pre-mRNA cis-elements.

HA-MBNL1 expression stable cell line production
The full length MBNL1 (isoform 41) with an N-terminal HA tag was cloned into the supplied vector (pcDNA5) and transfected into the HEK293 T-REx FLP cell line (Life Technologies) to create the inducible line following the manufacturer's protocol.

Cell culture and transfection
HEK293 cells with inducible MBNL1 were routinely cultured as a monolayer in Dulbecco's modified Eagle's medium (DMEM)-GlutaMax (Invitrogen) supplemented with 10% fetal bovine serum (Gibco) at 37°C under 5% CO2. Cells were plated in twenty-four-well plates at a density of 1.5 x 10 5 cells/well. Fresh doxycycline (Sigma) or tetracycline (Sigma) was prepared at 1 mg/ml, diluted, and added to the cells at the appropriate concentrations. A pool of three siRNA duplexes (CACUGGAAGUAUGUAGA GAdTdT, GGACAAAUGUGCUUG GUUUUUU, and GAGAGAAACCUGUAUAAUAUU) were transfected into the cells 24 hours prior to harvesting cells using TransIT-siQUEST transfection reagent as per the manufacturer's protocol. Prior to transfection, cells were plated in twenty-four-well plates at a density of 1.5 x 10 5 cells/well. Cells were transfected 24 h later at approximately 80% confluence. Plasmid (500 ng/well) was transfected into each well with 1.5 ul of TransiT 293 (Mirus) following the manufacturer's protocol. 250 ng of either empty vector (pcDNA3.1+) or DMPK-CTG 960 was co-transfected into a single well with 250 ng of mini-gene reporter for all splicing assays. For DMPK-CUG 960 and HA-MBNL (pcDNA3.1+) each reporter was reduced so that the total DNA transfected remained constant and the ratio of DNA to transfection reagent remained constant. Fresh doxycycline (Sigma) or tetracycline (Sigma) was prepared at 1 mg/ml, diluted, and added to the cells at the appropriate concentrations four hours post transfection. The DMPK-CUG 960 plasmid was obtained from the laboratory of Thomas Cooper.

Splicing activity assays
Cells were lysed in the plate and RNA isolated using an RNeasy kit (Qiagen) twenty hours after induction. Isolated RNA (200 ng) was reverse transcribed using Superscript II (Invitrogen), according to the manufacturer's protocol. The RNA for all endogenous splicing event conditions were internally controlled by recovering RNA from the same plated well of cells. Endogenous genes were reverse transcribed using random hexamers except for MBNL1, which a gene specific primer was designed to anneal to endogenous transcripts but not the HA-MBNL1 transgene transcripts: 5'-CTGAGGAACTTTTGTGTGTGTTGCTTGACG-3'. All reverse transcription reactions were subjected to PCR amplification in a 20 ul reaction mixture using flanking exon-specific primers. The number of amplification cycles was determined to be within the linear range for all primers used. Endogenous gene PCR primers (hg19 coordinates are the same as in S2 Table) are as follows: ATP2A1 primers are described in Purcell et al [11], PCR products were resolved by gel electrophoresis on 1.5-mm, 6% native polyacrylamide (19:1) gels run at 300 V for 90 min. Splice products were visualized and quantified using SYBR green I nucleic acid stain (Invitrogen) in combination with an AlphaImager HP system (Alpha Innotech). All reported values were obtained from at least three independent splicing experiments. All experiments with MBNL1 pre-mRNA mini-gene mutations included a WT internal control.

RNAseq
Muscle biopsies, repeat length estimation, and patient strength measurements were conducted as described in Nakamori et al [17] and [47]. 1 ug of total RNA was prepared for Illumina strand-specific RNA-seq using standard methods. In brief, rRNA was removed using the Ribozero beads (Epibio), and remaining RNA was fragmented using fragmentation buffer (Ambion #AM8740). RNA was precipitated and converted to cDNA using Superscript III, random primers, and dUTP, and 2nd strand cDNA using DNA polymerase I and RNase H. Ends were repaired, adenylated, and ligated to Illumina paired end adapters. Ligated fragments were purified by agarose gel electrophoresis (~200 bp band was excised), treated with USER enzyme, and subjected to 14-18 cycles of PCR. A second gel was run to purify product away from primers. Libraries were sequenced in pooled sets of 4 or 6 samples on the Illumina Hiseq, using 2 x 57 bp sequencing runs.

Identification of mis-spliced exons in DM1 patient samples
Illumina RNA-sequencing reads from the tibialis anterior muscle biopsies (44 DM1 and 11 unaffected control) were aligned against the hg19 human reference genome with GSNAP and allowing for novel splicing [48] (http://research-pub.gene.com/gmap/). GSNAP was run with the following options set:-s [splice sites map file]-N 1 -A sam-o FR-pairexpect 300-pairdev 100. A splice sites map file was generated from the hg19 gene models (GrCh37 release 75). Isoform abundances were estimated and each of the 44 DM1 samples was compared to each of the 11 control samples using MISO [33]. A custom set of hg19 alternative event annotations were generated from the Refseq, Ensembl, and UCSC gene models using custom scripts. For this analysis we focused only on the differential inclusion and exclusion of "cassette" exons between the DM1 and control samples. The comparison data was filtered based on the criteria described below to identify a high confidence set of misregulated alternative exons events.
Filtering criteria: 1. ! 70% of control and DM1 samples have sufficient coverage over the event.
a. median Bayes factor ! 5 Curve fitting (Figs 1-3) For data in Figs 1-3, splicing activity curves were fitted using Graphpad Prism software using non-linear curve fitting (log(agonist) vs. response-Variable slope (four parameters)). The min and max were restricted to fall between 0 and 1, respectively.

YCGY motif enrichment near differentially regulated exons
A student's t-test was used to compare the presence of 4-mer motifs that bind MBNL proteins (identified previously [10]), using a 20 bp sliding window within the 400 bp region upstream of cassette exons that are more excluded in controls, with respect to patients, to the same region of an identical number of randomly sampled exons (average sampled using 1000 iterations). A similar comparison was performed with motifs located 400 bp downstream of cassette exons that are more included in controls, with respect to patients.

Bayesian inference to estimate curve parameters and [MBNL]
We modeled the problem using Bayes' Rule as follows: pð0j c ! Þ / pð c ! j0Þ Ã pð0Þ where c ! is the observed inclusion level of an exon across multiple doses or samples, and ϕ describes C min , C max , log(EC50), slope, [MBNL], and a parameter σ. We assume that observed C values are drawn from a normal distribution centered around the modeled C value, with standard deviation σ. Priors for each estimated parameter were as follows: C min~U niform(0, 1), C max~U niform(0, 1), log(EC50)~Normal(μ = 0.5, σ = 1), slope~Normal(μ = 0, σ = 1), [MBNL]Ũ niform(0, 1), σ~Uniform(0, 1). The python package PyMC3 was used to implement Bayesian inference [49]. MCMC sampling was performed using the NUTS sampler and 1000 iterations. This approach was used to infer [MBNL] for all panels in Fig 4,  , ω). Here, ω describes C min , C max , log(EC50), slope and σ, a parameter describing the standard deviation of the normal distribution from which observed C values are drawn from the modeled C value (similar to above). In this case, however, σ is directly computed from the training data, so that events with observed C values that closely match predicted C are more highly favored as biomarkers. Fig 5A shows Fig 5B). To compute predictive power when using multiple splicing events as biomarkers, rather than take an exhaustive approach, which would require testing upwards of billions of combinations of splicing biomarkers (for example, to sample 46 choose 10 is~4 billion), we took a recursive addition approach, where we identified the best biomarker, then identified a second biomarker yielding the best performance in concert with the best biomarker, then a third biomarker yielding the best performance in concert with the first 2 biomarkers, and so on (the best biomarkers are shown in Fig 5D). Predictive power was computed as the joint probability distribution of all posterior estimates of [MBNL] for chosen biomarkers.

Schematic diagrams of splicing events
YGCY motifs were plotted to scale within the regulated exon, 200 nucleotides upstream of the regulated exon and 200 nucleotides downstream of the regulated exon.

TPM calculations (S1 Fig)
Kallisto [50] was used to estimate transcripts per million using a kallisto transcriptome index generated from the hg19 Refseq/Locuslink coding sequences (CDS) fasta data set. Paired-end FASTQ files were used for TPM quantification and the average fragment length of the elibary was automatically esimated by Kallisto. Total TPM values were calculated by summing Refeq TPM values across all corresponding Refseq IDs for a given geneID.  [24]. Single stranded regions are marked with a dashed line, nucleotides with increased cleavage in the presence of MBNL1 are marked with an E, and MBNL1 protected nucleotides are marked with a P (C) Predicted secondary structures obtained using the mfold server for MBNL1 WT and mutant RNAs [43]. (TIF) Schematic element spacing is drawn to scale. (TIF) S1 Table. Events misregulated in DM1 compared to unaffected control tibialis. Splicing event coordinates from MISO from hg19, gene symbol, mean C for control and DM1 samples, standard deviation, mean ΔC, number of samples used to calculate the mean for control and DM1, number of samples for each event with a Bayes Factor greater than 5, and % patients with dysregulated splicing for each event (ΔC > 50%, normalized to the maximum ΔC for each event) are shown in the table. (XLSX) S2 Table. C estimates for events perturbed in DM1 tibialis muscle compared to unaffected controls. C estimates from MISO for genes (indicated by gene symbol and hg19 coordinates) are shown for all samples in this study. NA is used for samples that did not have sufficient coverage to obtain an estimate for that splicing event. (XLSX) S3 Table. Curve-fitting parameters for events dysregulated in DM compared to unaffected controls. Bayesian posterior mean estimates for parameters min C, max C, EC50, slope, respective 5% and 95% confidence intervals, and σ are shown for each splicing event (event_parameters tab). Inferred [MBNL], 5% and 95% confidence intervals, and mean delta psi are shown for each sample (MBNL_inferred tab). Biomarker predictive power for all events ( Fig  5B) and optimal combinations for choosing one to forty-six biomarkers (Fig 5D) are shown (single_biomarkers and biomarker_combinations tabs, respectively). (XLSX)