Discovering Transcription and Splicing Networks in Myelodysplastic Syndromes

More and more transcription factors and their motifs have been reported and linked to specific gene expression levels. However, focusing only on transcription is not sufficient for mechanism research. Most genes, especially in eukaryotes, are alternatively spliced to different isoforms. Some of these isoforms increase the biodiversity of proteins. From this viewpoint, transcription and splicing are two of important mechanisms to modulate expression levels of isoforms. To integrate these two kinds of regulation, we built a linear regression model to select a subset of transcription factors and splicing factors for each co-expressed isoforms using least-angle regression approach. Then, we applied this method to investigate the mechanism of myelodysplastic syndromes (MDS), a precursor lesion of acute myeloid leukemia. Results suggested that expression levels of most isoforms were regulated by a set of selected regulatory factors. Some of the detected factors, such as EGR1 and STAT family, are highly correlated with progression of MDS. We discovered that the splicing factor SRSF11 experienced alternative splicing switch, and in turn induced different amino acid sequences between MDS and controls. This splicing switch causes two different splicing mechanisms. Polymerase Chain Reaction experiments also confirmed that one of its isoforms was over-expressed in MDS. We analyzed the regulatory networks constructed from the co-expressed isoforms and their regulatory factors in MDS. Many of these networks were enriched in the herpes simplex infection pathway which involves many splicing factors, and pathways in cancers and acute or chronic myeloid leukemia.


Introduction
Gene expression levels are highly dependent on the regulation of transcription factors which mainly bind to the near-promoter regions to facilitate or block the recruitment of DNA polymerase II (pol II) and other complexes. Some methods have been proposed to predict -gene expression using such binding information of transcription factors [1,2]. Conlon et al. [2]suggested associating gene expression with the interaction strength between the upper stream of the gene and motifs of its transcription factors. This interaction was defined in terms of degree of matching and occurrences of binding sites. After that, a probabilistic approach was proposed to infer regulatory rules of transcriptional networks from gene expression data and DNA sequences [1]. This method started with finding co-expressed genes, and then extracted a large number of putative regulatory DNA motifs in these co-expressed genes. However, these methods only considered transcription levels extracted from gene array data and lost sight of other mechanisms such as alternative splicing. Moreover, these methods may suffer high false positive rate when mining transcription factor binding sites. To control the rate of false positive, integrating information from other kinds of data, like conservation data, is also necessary.
Alternative splicing is one of the most versatile mechanisms of gene expression regulation and accounts for a considerable proportion of proteomic complexity in higher eukaryotes. The mRNA isoforms produced by this alternative processing comprise of different combination of exons, and may differ in structure, function, and other properties [3,4]. Different mRNA isoforms are translated into different protein isoforms (if they exist) which may have related, distinct or even opposing functions.
The alternative splicing process is regulated by many types of RNA binding proteins, especially the heterogeneous nuclear ribonucleoproteins (hnRNP) and the serine/arginine-rich (SR) family. In eukaryotic cells, hnRNP proteins participate in almost all pre-mRNA processing steps including splicing, mature mRNA export, localization, translation, and stability [5,6]. SR proteins are involved in regulating and selecting splice sites in eukaryotic mRNAs. These proteins, either 'classical' SR proteins or additional SR proteins [7], generally have at least one RRM (RNA recognition motif) domain for RNA binding and a Cterminus RS domain for controlling interactions with other proteins (including other SR proteins). In addition to alternative splicing, the SR proteins are also involved in mRNA nuclear export and mRNA translation. One of the most important characteristics of these splicing factors is their functional specificity. Many animal experiments suggest that the RNA binding ability of individual SR proteins are sequence-specific and their ability to regulate alternative splicing is different [8][9][10].
These highly specific and non-redundant characteristics of splicing factors motivated researchers to look for association between abnormalities in SR proteins and the development of human cancers. Although the underlying mechanisms are elusive and need further studies, splicing factors that regulate specific pathways in diseases can be treated as putative markers, especially when their targets in these disease-related pathways have experienced alternative splicing and produced different protein isoforms.
In this study, we built a systematic method to identify transcription factors and splicing factors that regulated genes to produce different RNA isoforms in diseases. Our framework consists of four steps. First, differentially expressed mRNA isoforms (DEIs) are extracted by comparing abnormal cells and . Process raw RNA-seq data, find out deferentially expressed isoforms using Tophat and Cufflinks and cluster these isoforms to get gene cluster that may be regulated by same TFs and SFs. (B). Construct two dataset, promote region data (PRD) and exon-intron data (EID), for mining the interaction strength of the TF-isoform interactions and SF-isoform interaction. (C). Use interaction strength to predict the expression levels of isoforms in a co-expressed group. (D). Link model-selected TFs and SFs with their target genes. doi:10.1371/journal.pone.0079118.g001 controls using RNA-seq analysis tools [11]. These DEIs that may experience abnormal splicing are putative targets of splicing factors that might function abnormally in disease. Assuming that co-expressed genes have a good probability of being functionally related, we clustered these DEIs to co-expressed groups using hierarchical clustering method. These co-expressed isoforms may be regulated by the same group of transcription factors (TFs) and splicing factors (SFs). Since co-expressed mRNAs are more likely to have their promoter regions bound by common transcription factors, we constructed a dataset called promoter region dataset (PRD) to mine the TF-isoform interactions (for example, coexpressed isoforms may also have common splicing factors that bind to the regions near their splicing sites). We then constructed an exon-intron (centered at splicing sites and extending 200 bp on both sides) dataset (EID) to explore SF-isoform interactions. The binding strength of the TF-isoform interactions is quantified by scoring the transcription factor's binding sites in the promoter region. Similarly, the binding strength of the SF-isoform interactions is defined by scoring the splicing factor's binding sites in the exon-intron regions of pre-mRNA. To integrate both kinds of regulation, we built a linear regression model and selected a subset of transcription factors and splicing factors that can regulate the expression of co-expressed isoforms using least-angle regression (LARS) [12] selection approach.
The proposed method was applied to a RNA-seq dataset comprising of 4 myelodysplastic syndromes (MDS) samples (RAEB subtype) and 5 matched controls. MDSs are defined as clonal stem cell disorder characterized by ineffective hematopoiesis and impaired difference in some bone marrow lineages, leading to peripheral-blood cytopenias. According to the WHO [13], the main categories of MDS include refractory cytopenia with unilineage dysplasia (RA), refractory cytopenia with ringed sideroblast (RARS), refractory cytopenia with multi-lineage dysplasia (RCMD), refractory cytopenia with excess blasts (RAEB), 5q-syndrome and unclassifiable MDS. In this paper, we focused on RAEB cases which are characterized by 5-20% myeloblasts in the marrow [14]. This is a higher risk subtype and likely to transform to acute myelogenous leukemia (AML).
Some work studied the genetic abnormality of MDS with mutations of key genes [15], e.g. TP53 and RUNX, which are highly related with poor overall survival. Recently, the recurrent mutation of splicing factor U2AF1 has been validated for MDS patients [16]. In-vitro experiments demonstrated that the mutated U2AF2 enhances splicing and promotes exon skipping. These genetic aberrations would improve the prediction of prognosis and development of novel treatment. However, mutation detection cannot determine which genes are altered due to the abnormality of splicing factors. Our method can recognize not only isoforms that experience an abnormal splicing process but also their putative regulatory factors. The co-expressed isoforms and their transcription and inferred splicing factors comprise our transcription and splicing networks.
To verify the biological significance of these regulatory networks composed of a group of co-expressed isoforms and their regulatory factors in MDS, we showed theses networks were significantly enriched (p,0.0001) in some known Gene Ontology (GO) biology processes and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Enrichment analysis demonstrated that more than half of these networks were enriched in herpes simplex infection due to involving many splicing factors. Six networks were significantly related with pathways in cancer, and five networks were significantly related with acute or chronic myeloid leukemia (AML).

Methods
Our four-step framework ( Figure 1) starts with raw RNA sequencing (RNA-seq) data. First, differentially expressed isoforms (DEIs) are identified from a large amount of isoforms. We cluster these DEIs into co-expressed groups that may be regulated by common TFs and SFs ( Figure 1A). At the second step, two datasets, promoter region data (PRD) and exon-intron data (EID), are constructed, where the PRD dataset is used for mining the interaction strength between TFs and isoforms and the EID is used for exploring the interaction strength between TFs and isoforms ( Figure 1B). This step outputs two interaction strength matrices with rows corresponding to isoforms and columns corresponding to TFs or SFs. Then, by taking the interaction strength matrices as observations of explanatory variables, we build linear model to regress the expression levels of isoforms in a co-expressed group. To detect the most important factors that regulate a co-expressed group, a reliable model selection method called least-angle regression (LARS) is applied ( Figure 1C). At the final step, the TFs and SFs selected by the LARS are linked with their target genes, forming regulatory networks ( Figure 1D).

Ethics Statement
This study was approved by the Institutional Review Board of The Methodist Hospital, Houston, Texas, USA and the need for written informed consent from the participants was waived by the IRB.

Data Preprocessing and Co-expressed Isoforms
RNA samples were prepared from 9 individuals, 4 samples from RAEB patients and 5 controls (http://ctsb.is.wfubmc.edu/MDS/ MDS.html). After sequencing using Genome Analyzer II (GAII) (Illumina, San Diego, CA), millions of reads were produced for each sample. The raw data were processed using a RNA-seq analysis pipeline [11]. Briefly, reads in the FASTA format were first aligned to whole genome using TopHat software. Since our sequencing data are pair-ended, the fragments were selected at 350 bp and the length of reads was 76, the inner distance between mate pairs was set as 198. The standard deviation for the distribution on inner distances between mate pairs was set as 30 based on our estimation. Then Cufflinks was called to estimate the expression level (RPKM, reads per kilobase of transcript per million mapped reads) of each isoform. Then, Cuffdiff was used to identify differentially expressed isoforms (using default parameters). This protocol returned 1056 isoforms that were differentially expressed in MDS compared with controls. Before using these DEIs to build splicing networks, we first filtered out those isoforms that had not been validated at the protein level. Then, we checked the condition (control or MDS) in which an isoform was upregulated and required that all FPKM values under this condition were higher than 5 [17].
Although there are debates on whether co-expressed genes are functionally related, numerous studies have suggested that at least some co-expressed clusters function together [18][19][20]. Here we applied hierarchical cluster analysis to those differentially expressed isoforms to obtain co-expressed groups. The distances between genes were defined as Pearson correlation. Allocco et al. [21] concluded that genes with strongly co-expressed mRNAs were more likely to have their promoter regions bound by common transcription factors. However, this co-regulating effect was significant only when expression of these co-expressed mRNAs were highly correlated. To control the similarity level of expression profiles of isoforms in the same group, we limited the  PSF SFPQ Splicing factor proline/glutamine-rich (polypyrimidine tract binding protein associated).

SF1
Splicing factor 1. Gomafu lncRNA UACUAAC repeats bind to mouse SF1 with a higher affinity than the mammalian branch point consensus regulating splicing efficiency by changing the splicing factors nuclear level (PMID: 21463453)
size of each group to be between 10 and 30 [19,22]. In our MDS study, this resulted in 34 clusters ( Table 1).

Mathematic Modeling
Transcription factors regulate transcription, which controls gene expression. Splicing factors regulate alternative splicing, which splices pre-mRNA to RNA isoforms and in turn changes protein expressions [23]. Some predictive methods have modeled gene expression levels as a linear function of occurrences of TF-binding sites (TFBSs) [2,24]. In this paper, we focus on RNA isoforms instead of genes and try to link isoform expression levels not only with the transcriptional factors but also with splicing factors. Isoform expression levels were formulated as linear regression of the strength of TF-isoform interactions and SF-isoform interactions, given by To estimate the binding strength between TFs and isoforms, we first constructed a promoter region database (PRD, Figure 1B) comprised of promoter sequences of isoforms. As in previous study [25], we extracted 2000 bp upstream of the transcription start sites as promoter regions. Similarly, we constructed an exon-intron dataset (EID, Figure 1B) to estimate the binding strength between SFs and isoforms. Most splicing factor binding sites locate near the splicing sites [26], especially 200 bp on both sides of this site. Therefore, for isoforms with alternative splicing, we gathered 200 bp of sequences around their splicing sites into EID to find splicing factor binding sites.
Based on the PRD and EID, we could define the TF-isoform interaction strength and SF-isoform interaction strength. First, we

Splicing Factors
Gene Name Description SRp38 FUSIP1 FUS interacting protein (serine/arginine-rich). Dephosphorylation converts SRp38 to a splicing repressor (PMID: 12419250). SRp38 functions as a general splicing repressor when dephosphorylated, but when phosphorylated it functions as a sequence-specific splicing activator (PMID: 18794844).

LARS algorithm
Data: Normalized expression levels of co-expressed isoforms Y N|1 , normalized interaction strength matrix X N6P

Output: Regression coefficients a P61
All coefficients a i (i~1, Á Á Á P) equal to zero; Active set A~1; Find predictor x j1 most correlated with Y N61 ; Let direction D~x j1 ;

Repeat
Adjust the coefficient in the direction D at the highest step possible until some other explanatory variable x jm has the same absolute correlation residual r~Y {Ŷ Y ;

Put x jm in A;
Let D in the direction that is equiangular with x j1 ,x j2 , Á Á Á ,x jm downloaded TF motif models (PWM, position weight matrixes) from TRANSFAC [27] and JASPAR [28]. Then, for each TF, a hidden markov model based method called MAPPER [29] was used to identify its sites (TFBSs) in PRD. For each binding site, the MAPPER also outputs a binding score measuring the binding affinity. TFs with no hits on PRD were removed. After retrieving all putative binding sites and their binding scores, we defined the interaction strength between a TF j and a RNA isoform i as S TFij~m ean(S ij,1 ,S ij,2 ,S ij,3 , Á Á Á ,S ij,N ). Where, S ij,n is the score of n-th binding site between i-th isoform and j-th transcription factor; and N is the number of binding sites. Unlike TFs, SFs lack a reliable PWM database available. Thus, we gathered 53 splicing factors that are related with bone marrow cancers and their binding motifs from the SpliceAid [30] repository. This database collects all the experimentally assessed motifs that are bound by splicing factors in humans by means of an exhaustive literature search. Motifs with positive scores facilitate exon definition as exonic splicing enhancers (ESEs) or intronic splicing silencers (ISSs) motifs, while motifs with negative scores   Table 2). Since motifs of these splicing factors are degraded short pieces (6 to 10 base pairs on average), when retrieving all binding sites of these SFs using alignment tool Bowtie [31], we got numerous hits, most of which were false positives. To identify putative binding sites with high reliability, we downloaded base-wise conservation scores (phyloP) by comparing 45 vertebrate genomes with human genome from the UCSC Genome Browser website (http://genome.ucsc.edu/). We defined the conservation score of a hit as the average conservation score of the nucleotides it spanned. Hits with scores lower than 2 were deleted. Though conversion score is not directly related with interaction strength, a highly conserved binding site on an isoform does provide evidence of a strong interaction. We further averaged conservation scores of all hits of a splicing factor on an isoform to eliminate the effect of exon numbers. This average is defined as interaction strength between a SF and an isoform, S SFij~m ean(S ij,1 ,S ij,2 ,S ij,3 , Á Á Á , S ij,N ), where, S ij,n is the conservation score of n-th binding site between i-th isoform andj-th splicing factor; S SFij is the interaction strength between j-th splicing factor and i-th isoform; Nis the number of binding sites. Some factors, especially the hnRNP family, are due to alternative splicing, for example hnRNP H1, hnRNP H2 and hnRNP H3, which means they may have similar structure [32],motifs [33] and binding profiles. After we computed their interaction strength with isoforms according to our definition, the correlation between interaction strength vectors were high. To increase the robustness of our linear regression model, we averaged the binding strength and obtained four new factors, called hnRNP A1/A2 (average of hnRNP A1 and A2), hnRNP H1/H2/H3(average of hnRNP H1, H2 and H3), hnRNP C/C1/ C2 (average of hnRNP C, C1 and C2) and hnRNP E1/E2 (average of E1 and E2). Integrating all these binding strength, the results were two interaction strength matrices with rows corresponding to isoforms and columns corresponding to TFs or SFs. These were bound  together to build linear regression model in equation (1) to fit isoform expression levels ( Figure 1C).

Model Selection
The dependent variable (isoform expression) and explanatory variables (interaction strength value of factors) in the linear regression model in (1) were all normalized. Many methods such as ordinary least squares (OLS), ridge regression with a L2 norm penalization [34], LASSO with a L1 norm term [35], and Leastangle regression (LARS), can be adopted for optimization and selection of linear model. However, when the number of explanatory variables is higher than number of observations, it is more critical to balance regression accuracy and interpretation capability. Unfortunately, the OLS method incurs both regression accuracy and interpretation ability problems. Ridge regression fails to achieve a parsimonious set of predictors, which may lead to stable regression accuracy but a poor interpretation of model. LASSO, in which a L1 norm penalization term is employed, helps to improve both issues. However, it adopts quadratic programming techniques to solve a constrained optimization problem, which may limit its application when the sample size is small with respect to the number of explanatory variables. Therefore, in this paper, we adopted a promising model selection method called LARS [12], which can balance the accuracy and interpretation capability better, to selected TF and SF factors that regulated each group of co-expressed isoforms. The LARS algorithm is summarized in Table 3.

Biological Significance of these Regulatory Networks
At the last step of the flowchart in Figure 1, a transcription and splicing network (TSN) was constructed with a group of coexpressed isoforms and the selected TFs and SFs using the method described above. To validate the biological significance of these regulatory networks we conducted a comprehensive functional enrichment analysis.
First, we studied the enrichment of these inferred regulatory networks in the KEGG pathway [36]. The KEGG database is a collection of manually drawn pathway maps representing exiting knowledge of the molecular interaction and reaction networks. We mapped all items (isoforms, TFs and SFs) in TSNs to their ENTREZ ID and gathered all KEGG pathways from http:// www.genome.jp/kegg/. The enrichment of a network in a pathway was evaluated using Fisher's exact test.
We also used the Gene Ontology (GO) [37] database for enrichment analysis. This project defines GO terms and structures GO ontology as a directed acyclic graph through which each term has relationships to one or more other terms in the same domain. We downloaded GO biological processes through package GOSim [38] for R language. Considering the size of our regulatory networks, we filtered out GO biological processes involving more than 100 genes or fewer than 10 genes. Fisher's exact test was performed using enrichment analysis function (GOenrichment) in the GOSim package.

Results
To demonstrate how our method works, we first looked at results of linear regression. After regression and model selection, 31 networks were obtained. Three co-expressed groups that did not fit the linear model well were removed. Table 1 lists the gene names of each co-expressed group and the corresponding TFs and SFs identified by our method. Here we choose Network 1 as an example for further explanation ( Table 4). In this network, there are 12 target co-expressed isoforms. For convenience, we used gene names instead of isoform names. The LARS algorithm identified 5 transcription factors and 5 splicing factors that may regulate their expression. The adjusted coefficient of determination (adjusted R squared) is 0.991, which means expression of these co-expressed isoforms was well fitted by their interaction with the selected factors. Coefficients in regression are indicators of contribution from corresponding factors. Factors with higher coefficients, such as MBNL1, PSF, Sam68 and SRp38, predominate in influencing expression of this group of isoforms. On the other hand, factors with lower coefficients, such as FOX factors and Blimp-1, contribute less to determining expression of their targets. Some of these coefficients are negative, which means these factors may inhibit these isoforms' expression. Details of the regression coefficients are also listed ( Table 4). Many factors may be involved in the expression of these isoforms. Here, LARS only selected the most representative ones that not only could regulate the expression of these isoforms linearly but also did not suffer from an over-fitting problem.
We also checked all the differentially expression isoforms using Ingenuity Pathway Analysis (IPA, http://www.ingenuity.com/ index.html). In all 14 genes (FLT3, KIT, RPL13A, RPL3, RPL6, RPL7, RPS14, RPS15A, RPS16, RPS19, RPS20, RPS4X, RPS6 and RUNX1) had been reported in MDS. Our isoform analysis asserted that they all experienced alternative splicing. However when we used traditional gene array analysis (7 MDS cases, 7 controls downloaded from Gene Expression Omnibus [Gene Expression Omnibus number: GSE16236]), only 2 were recognized as being differentially expressed. This result demonstrated that using only gene expression analysis is not sufficient. Isoform and splicing analysis provides more information about gene profiles and their regulatory mechanism.

MDS-related Factors
We then analyzed the transcription and splicing factors detected by our method. Some of these factors have already been linked with MDS or cancers in the literature, while others, such as SRSF11, may be candidate factors that need further validation.
EGR1. The early growth response gene EGR1 appeared in 58% (18 of 31) of our networks. This factor plays important roles in the regulation of cell growth, differentiation, and survival and has been confirmed as a candidate tumor suppressor gene within the commonly deleted segment of 5 q in MDS. However, accumulating evidence now indicates that it can act a tumor promoter in some cancer, such as prostate [39]. None of our MDS samples are del(5 q), the EGR1 was overexpressed in all disease samples compared to controls,. This suggests that EGR1 plays a significant role in the development of MDS and that its function in non-del(5 q) MDS needs further study.
STAT family. Like EGR1, the STAT family is also involved in cell growth, differentiation, and survival and plays a role as candidate regulator in almost one third of our networks. This group of factors is typically oncogenic through the constitutive activation of tyrosine kinase. The most canonical pathway for the STAT family is FLT3 signaling in hematopoietic progenitor cells. This relationship appeared in our network 18. In this network, two STATs were predicted to regulate the expression of FLT3. Though FLT3 positively regulates the tyrosine phosphorylation of STAT proteins in the FLT3 signaling pathway, STATs are still on the card to form signaling by regulating the expression level of FLT3 through an auto-regulatory loop. He et al. [40] discussed a positive auto-regulatory loop in the Jak-STAT pathway. STAT1/ 3 in this loop tends to induce the expression of various components of the Jak-STAT pathway to strengthen the signaling. Over activation of FLT3 via the ITD mutation is related to the pathogenesis of AML/MDS and is an adverse prognosis marker.
We downloaded splicing factors from RBPDB [43] and checked for alternative splicing. Among these 40 splicing factors, SRSF11 showed isoform switching in 7 of 20 MDS samples (Figure 2A), including 4 RAEBs used to build our model, one RCMD, one AML with MDS and one MDS for which the subtype is unknown. According to the annotation information from the UCSC database, the SRSF11 gene has eight isoforms, three of which (uc001deu.2, uc001dev.3, uc009wbj.1) are highly expressed in our samples. Isoform uc001deu.2 and uc001dev.3 have evidence at the protein level (called p54, Uniprot ID Q05519), while the protein of uc009wbj.1 has not been found yet. Although the total expression levels of these three isoforms were almost the same in MDS and control samples, 2 isoforms (uc001deu.2, uc001dev.3) were highly expressed in MDS samples, whereas the uc009wbj.1 isoform was highly expressed in control (average expression values). The shorter amino acid chain from uc009wbj.1 only contained the RRM (RNA recognition motif), while p54 not only contained RRM at the N-terminal but also had a C-terminal RS domain. This domain promotes protein-protein interactions to facilitate recruitment of the spliceosome [7,44,45] (Figure 2B). These two kinds of splicing factors (with or without SR domain) exhibit two different functions. One is RS-domain dependent (recruiting function) and the other is RS-domain independent (antagonist function) [46]. Hence, we hypothesized that, in MDS cases, the recruiting function of SRSF11 was enhanced and the antagonizing splicing inhibitors was weakened. However, we needed further data to evaluate the effect of this splicing switch.
We also downloaded the protein expression profile of SRSF11 from the Model Organism Protein Expression Database (MO-PED). We found that SRSF11 protein is highly expressed in hematologic diseases ( Figure 2D). It appears that the higher expression of SRSF11 protein is due to the higher expression of uc001deu.2 and uc001dev.3.

Enrichment Analysis
To evaluate the biological function of these 31 networks, we comprehensively analyzed their enrichment in KEGG pathways and GO biological process terms using the Fisher-exact test. Twenty (64.5%) of 31 networks were enriched in at least one KEGG pathway with an FDR-corrected q-value,0.05. Table 5 lists the MDS-related networks. The most enriched pathway is herpes simplex infection in which splicing factors are extensively involved. The second most enriched pathway is pathway in cancer. This is a very general pathway including many diseases, including AML, due to its important role in proliferation. There were also two networks (NT18 and NT20) enriched in the acute myeloid leukemia pathway, the PPAR signaling pathway and the Jak-STAT signaling pathway. Though these pathways are reported with AML, our RAEB subtype which has high risk of transforming to AML should a have similar gene profiles with AML.
These networks were also enriched in 42 different GO biological processes and 21 (68%) were enriched in at least one process (Pvalue ,1e-4). Table 6 lists three selected networks and their corresponding biological processes. Most of these biological processes are related with splicing, including mRNA 59-splice site recognition, regulation of RNA splicing, and mRNA 39-end processing.

Discussion
From transcription to translation, gene expression is modulated by many factors. Traditional predictive models of gene expression only consider the transcription. In this study, we proposed a systematic approach to recognize putative regulatory factors regulating co-expressed isoforms that were differentially expressed in disease. In case of MDS, the most recurrent transcription factors involved in regulating abnormally expressed genes were NKX2-5 and Egr-1. NKX2-5 is a master transcription factor. EGR1 is a candidate tumor suppressor gene within the commonly deleted segment of 5 q and has been claimed to play a role in murine leukemogenesis and development of AML/MDS characterized by abnormalities of chromosome 5. Its overexpression in our MDS cases indicates it may also act as tumor promoter as in prostate cancer. Additionally, we found some putative MDS-associated splicing factors, e.g. SF2 and SRSF11. They were highly related with developmental pathways that were deregulated in MDS cases. Previous reports confirm that SF2 is an oncogene and overexpression of SF2 may cause some tumor suppressors to lose function [49]. Our MDS samples verified its overexpression. We also detected a significant splicing switch of factor SFRS11. The ratio of the isoforms produced by the alternative splicing of SFRS11's pre-mRNA is significantly different in controls and MDS samples. This provided evidence that aberrant expression and regulation of splicing factors may result in the deregulation of splicing in diseases. Overall, our method is a good choice to detect these disease-related factors.
In addition, this study offered a method to construct transcription and splicing networks. Taking network 18 ( Table 1) as an example. We first input the target genes (DEIs) into the IPA system, and found that these genes are enriched in hematological system development and function, gene expression and cellular development networks. In order to look into the details, only a sub graph (dark nodes in the middle of Figure 3) was displayed. Then we added TFs and SFs to demonstrated regulatory relation. All these edges between target genes and factors (TFs and SFs) were determined based on the interaction strength matrix. Although the connection between factors and target genes were determined by our algorithm instead of experiments, some of them were supported by literatures. For example p53 is one of the most important tumor suppressors. In our network, it was connected with the IFI16 (interferon-inducible myeloid differentiation transcriptional activator), which was consistent with the results in [50]. Those previous experiments indicated that p53 could up-regulate IFI16, and that functional interactions between IFI16 protein and p53 contributed to cellular senescence. The relationship between the FLT and the STAT family was supported by the FLT signaling pathway. Though FLT3 is upstream of the STAT family, it is very likely that a regulatory loop like in the Jak-STAT pathway [40] exists. In myeloid progenitor cells, Egr-1 bound to the Egr-1 promoter [51]. The regulation of STAT1 to EGR1 has also been postulated [52]. However, some novel connections that have not yet been reported, and their reliability needs further validation.
We suggested that splicing factor SRSF11 might function differently in MDS patients and controls. To comprehensively examine its target genes, we screened the whole genome using its reported motif and found a number of putative binding sites. Conservation scores for each binding site were also computed to control the false positive rate. Only one-third of hits with the highest conservation scores were kept for further analysis. Finally, we obtained 1148 conserved binding sites, corresponding to 158 genes. They were all putative targets of SRSF11. All these genes were analyzed using IPA software. The associated network with the highest scores were those involved in cellular development, cellular growth and proliferation and cell morphology. A total of 29 genes were associated with cancers, and 12 were associated with hematological disease, including the MTOR gene which is in the Akt/mTOR pathway and is critical for cell survival and proliferation in high-risk MDS patients [53]. Since the Akt/ mTOR has been advised as a therapeutic target for treating MDS, our study suggested that its abnormality might be related with the splicing switch of SRSF11.

Supporting Information
File S1 RT-PCR. A portable document format (pdf) file contains details of RT-PCR. (PDF)