A Computational Systems Biology Study for Understanding Salt Tolerance Mechanism in Rice

Salinity is one of the most common abiotic stresses in agriculture production. Salt tolerance of rice (Oryza sativa) is an important trait controlled by various genes. The mechanism of rice salt tolerance, currently with limited understanding, is of great interest to molecular breeding in improving grain yield. In this study, a gene regulatory network of rice salt tolerance is constructed using a systems biology approach with a number of novel computational methods. We developed an improved volcano plot method in conjunction with a new machine-learning method for gene selection based on gene expression data and applied the method to choose genes related to salt tolerance in rice. The results were then assessed by quantitative trait loci (QTL), co-expression and regulatory binding motif analysis. The selected genes were constructed into a number of network modules based on predicted protein interactions including modules of phosphorylation activity, ubiquity activity, and several proteinase activities such as peroxidase, aspartic proteinase, glucosyltransferase, and flavonol synthase. All of these discovered modules are related to the salt tolerance mechanism of signal transduction, ion pump, abscisic acid mediation, reactive oxygen species scavenging and ion sequestration. We also predicted the three-dimensional structures of some crucial proteins related to the salt tolerance QTL for understanding the roles of these proteins in the network. Our computational study sheds some new light on the mechanism of salt tolerance and provides a systems biology pipeline for studying plant traits in general.


Introduction
Salinity is one of agriculture's most crucial problems in large parts of the world [1]. Rice (Oryza sativa L.), which provides a major food source for about half of the global population, is considered as the most important cereal crop in agriculture, but it is salt susceptible [2]. Soil salinity is a major abiotic stress, which limits rice production in about 30% of the rice-growing area worldwide [3,4]. Under the heavy pressure of global population explosion and global climate change, studying rice salt tolerance is of high importance. Genetic improvements leading to salt tolerance of cereal crops in molecular breeding could help maintain stable global food supply [5]. Some traditional cultivars and landraces have been identified as tolerant to abiotic stresses, despite their undesirable agronomic traits such as tall plant stature, photosensitivity, poor grain quality and low yield. For example, Pokkali, an Indian landrace, can maintain high K+/Na+ ratio in shoot in a high salinity environment, and it could be a donor of salt-tolerance strains in breeding programs. FL478, an F2-derived F8, inherited the salt tolerance property in recombinant inbred lines from parents Pokkali and IR29. FL478 is also an improved indica cultivar used as a salt-susceptibility standard [6].
With years of continuous exploration, some general molecular mechanisms of salt tolerance in plants have been revealed. The high-salinity environment mainly disrupts the ironic and osmotic equilibrium of cells, and as a result, genes in several pathways are activated in response to high sodium concentration. Pathways related to ion pumps [7], calcium [8], SOS pathway [9], ABA (abscisic acid) [10], mitogen-activated protein kinases [11], glycine betaine [12], proline [13], reactive oxygen species [14], and DEAD-box helicases [15] are of significance in high salinity environment. They play different roles in maintaining high K+/ Na+ ratio, synthesizing and segregating ions, and controlling ion concentration [16]. The genes and transcription factors that encode or regulate these components often demonstrate irregular activities in a high salinity environment. At the cell level, the most significant activities in dealing with excessive salt in plants is pumping ions out of a cell to keep the ion equilibrium, while the vacuole located in the cell helps store some ions. In salt-resistant detoxifying mechanisms, especially sequestration by vacuole [17], many salt tolerance genes with high level of activities in a high salinity environment are related to vesicle, membrane and ion transport. For example, H+-ATPase as a proton pump on cytoplasmic vesicle maintains the ion equilibrium of the cell by pumping H+ to the vacuole to retain pH and transmembrane proton gradient [18]; Na+ transporter plays an important role in maintaining high Na+/K+ ratio in various tissues [19,20]. However, the global picture of salt tolerance mechanisms, especially rice-specific salt tolerance mechanisms is still unclear; for example, how ABA induces H 2 O 2 control and how a plant transduces signals in response to salt tolerance are largely unknown.
Multiple sources of data can enhance the understanding of salt tolerance. The genetic variations of different rice responses to salt stress may shed some light on the roles of various genes in salt tolerance. The availability of rice genome sequencing [21,22] further paved the way for in-depth study of rice salt tolerance. Oryza sativa microarray gene expression data have provided information on regulatory networks of salinity response. Kawasaki et al. analyzed the initial phase of salt stress in rice based on gene expression profiles [23]. Huang et al. identified a zinc finger protein named DST that regulates drought and salt tolerance in rice [24]. Zhang et al. studied OsGAPC3 over-expression in rice tolerance [25]. Mito et al. found that expression of DREB-and ZAT-related genes might be involved in the salt tolerance of the AtMYB102 chimeric repressor line [26]. Schmidit et al. examined transcription factors like heat shock factors (HSFs) in response to salinity environment and they characterized OsHsfC1b as playing a role in ABA-mediated salt stress tolerance in rice [27]. Nevertheless, these studies were mainly focused on a single gene or some isolated genes, and they lack systems-level understanding of the global molecular mechanism of salt tolerance given that salt resistance reacts in a coordinated and effective manner. In view of these findings, we conducted a systems-level study to fill the gap between isolated genes and the global mechanism of salt tolerance.
Among tens of thousands of genes in microarray data, it is challenging to choose the set of genes that are most relevant to salt tolerance [28,29]. Biologists often use a volcano plot method, which reflects both fold of change and its statistical significance at the same time in a heuristic fashion [30]. However, such a method may not be sufficient to discover some complex relationships between genes and a certain phenotype, trait, or condition [31]. Some statistical methods for clustering and classification are extensively used to deal with this problem [32,33]. Several machine-learning methods, such as random forest [34] and SVM-RFE (support vector machine recursive feature elimination) [35] were also developed for this purpose. RFE is commonly used for feature selection, and it eliminates features iteratively until getting the minimum subset of features. By combining SVM with the RFE procedure, SVM-RFE becomes an effective method in selecting and ranking genes in microarray data analyses [36,37,38]. In this study, we improved the volcano plot method using bootstrapping SVM-RFE to select informative genes related to salt tolerance from microarray datasets.
There are both challenges and advantages in analyzing rice data. For crops, there are typically limited experimental data available, and little bioinformatics work has been done on these data. On the other hand, crops, especially rice, have rich data related to traits, such as Quantitative Trait Locus (QTLs). QTLs are stretches of DNA containing or linked to genes that underlie a quantitative trait. It is a classical and widely used breeding method in identifying the actual genes underlying a trait in breeding experiments. With the availability of rice genome sequence, QTLs provide useful relationships between genes in a QTL region and its corresponding trait [39,40]. In this work, we used QTLs in validating selected informative gene sets. We also used predicted protein-protein interactions to build a protein network of informative genes. Some crucial genes in the network with QTL evidence were studied by protein structure prediction, coexpression and gene regulatory motif analysis. Our computational study provides useful hypotheses for studying salt tolerance and may help improve molecular breeding of rice in salinity.

Results
Using the microarray data, GSE14403, to compare salinity susceptible and resistant rice genotypes with the Gene Expression Omnibus (GEO), we chose the threshold as 0.05 in t-test p-value and 0.5 in MergeValue (as described in the Method section) to obtain 556 probes in the improved volcano plot in Figure 1. We assume that many of these 556 genes are related to salt tolerance, and they are listed in Table S1. Table 1 lists the gene enrichment result using AgriGO [41], a plant-specific GO term enrichment analysis. In molecular functions, the chosen genes are over represented in the categories of iron binding, cation binding, ion binding and heme binding-all of which may be active due to the high ion concentration in salinity. The significant behavior of these genes in oxidoreductase activity may be related to electron transport in complex chemical reactions that balances the charges during ion transport. The oxidoreductase activity may also be related to reactive oxygen intermediates (ROI) that are produced in response to oxidative stress due to a water deficit during salinity stress [14]. ROI can seriously disrupt normal metabolism through oxidative damage on lipids [42,43], proteins and nucleic acids [42,44]. The increased oxidoreductase activity is consistent with known activation of the antioxidative enzymes such as catalase (CAT), ascorbate peroxidase (APX), guaicol peroxidase (POD), glutathione reductase (GR), and superoxide dismutase (SOD) under salt stress in plants [16]. In biological processes, the cellular nitrogen compound metabolic process is over represented. As proline and glycine betaine accumulate under stress, they are correlated with osmotic adjustment to improve plant salinity tolerance [45]. Proline is also involved in scavenging free radicals, stabilizing subcellular structures and buffering cellular redox potential under stresses. Polyamines can be synthesized under salt tolerance [46]. In this sense, we speculate these nitrogencontaining compounds may be synthesized in ''the cellular nitrogen compound metabolic process.'' In cellular components, according to the gene enrichment analysis, most of the chosen genes are related to vesicle and membrane, which is consistent with the detoxifying mechanism of salt resistant genotypes, especially in sequestration by vacuole [17]. It is plausible to infer that some of these chosen proteins on membranes act as transporting ions outside the cell or to the vacuole to maintain pH, transmembrane proton gradient [18], and high K+/Na+ ratio [19].
In order to assess the performance of improved volcano plot, we developed a Microarray-QTL test by using the QTL information as a criterion to evaluate the reliability of chosen genes. As shown in Table 2, our chosen genes are compared with the QTL regions mapped in the whole genome and these genes show high statistical We also compared the performance of other feature selection methods in Document S1. We constructed a rice salt tolerance protein interaction network using the 556 genes selected by improved volcano plot methods as the nodes and protein-protein interaction data from DIPOS as the edges [47]. By merging the isoforms, the network contains 189 nodes and 705 edges, as visualized by Cytoscape [48] in Figure 2.
By analyzing the constructed network, we identified 17 modules with nodes located in QTL and flanked QTL region. Including flanked regions could help give some tolerance to the errors in QTL mapping [49]. Table 3 shows the functions of these identified modules related to known salt tolerance mechanisms.
The largest module in this network contains 51 genes of 47 merged nodes and 372 edges. Figure 3 depicts the radical layout of this module and Table S2 shows the node annotation in detail. The GO enrichment analysis reveals related salt tolerance activities in Table 4, and protein phosphorylation represents the most significant function. It is known that protein phosphorylation plays a vital role in ion homeostasis under salinity stress in Arabidopsis [50,51]. Under the salinity stress, phosphorylation often becomes active in signaling pathways; for example, MAPK transduces salt and other abiotic stress signals. In rice, Os-MAPK5 as a kinase can be triggered by salt, drought, wounding, cold, and ABA, resulting in an increase in tolerance to these abiotic stresses [10]. The Na+/H+ antiporter SOS1 mediates Na+ efflux. SOS2, a Ser/Thr protein kinase with N-terminal kinase catalytic domain regulates the activity of SOS1. SOS3, which senses salt stressinduced Ca2+ signature, thereby activating SOS2 to transduce the salt stress signal. Abscisic acid-mediated phosphorylation also plays a significant role in many activities within cytoplasmic proteins in rice under salinity stress [52]. The largest module that we identified includes these genes and others related to phosphorylation in nucleotide binding and kinase activities.
At the transcription level, we also checked whether the 51 genes in the largest module are transcriptionally co-regulated by examining if the promoter regions of these genes share conserved motifs as the regulatory elements. Three candidate motifs are predicted, and only one motif is validated by sequence comparison with known cis regulatory motifs in the PLACE database [53] as shown in Figure 4 and Table S3. This conserved motif is detected as ''TCTCTCTCT'', the CTRMCAMV35S motif, which is a CT-rich motif found in a 60-nt region downstream of the transcription start site of the CaMV 35S RNA, and can enhance gene expression [54]. We also formed arbitrary reference gene sets selected randomly in whole genome scale and Chi-Square test on the validated motif showed significant statistical significance at pvalue of 4.72e-13 (Table S4). We also checked the co-expression among the genes in this module. The averaged Pearson correlation coefficient of expression profiles inside the module is 0.402, comparing to 0.241 between genes in the module and randomly selected 51 genes in whole genome (Document S2). The motif analysis and co-expression analysis provide some support that this module is likely co-regulated.
At the whole genome level, we mapped all 556 selected genes, QTL regions and extended QTL regions by their individual positions on the rice genome of MSU Rice Genome Annotation (Osa1) Release 6 [55] in Figure 5 using Circos [56]. Some selected genes are located in the QTL and extended QTL regions. We also mapped the 51 genes in the largest module on the genome, together with their interactions.
We mapped the 51 genes of the largest module to the KEGG pathway (http://www.genome.jp/kegg/pathway.html). Three genes can be mapped to their Arabidopsis orthologs in the Plant-Pathogen Interaction Pathway [57,58], as depicted in Figure 6. In this environmental adaptation pathway, CNGCs (Os02g0789100) is a cyclic nucleotide gated channel, CDPK (Os01g0622600) is a calcium-dependent protein kinase, and CaMCML (Os01g0505600) is the calcium-binding protein CML. All three of these proteins interact with each other and cooperate together in response to calcium ion signaling.
One of the most interesting genes is LOC_Os01g52640.3, which is a hub gene in the largest module and overlaps with a QTL region. This gene corresponds to a hypothetical protein   Os01g0725800, which interacts with 32 of the 51 proteins in the module. It contains four InterPro domains, namely, IPR000719, IPR001680, IPR011046, and IPR011009. IPR011009 domains can also be found in RIO kinase (IPR018935), a SPA1-related, serine/threonine-specific and tyrosine-specific protein kinase. This protein also has an ortholog in Arabidopsis thaliana as SPA4 (SPA1-RELATED 4), which is a binding protein and a signal transducer. We applied MUFOLD [59,60] to predict the structure for LOC_Os01g52640.3. Using the identified templates of 2GNQ, 3EMH, and 3DM0 in PDB, we constructed the model for the protein region of 196-627 for the protein with the length of 432, as shown in Figure 7. The protein structural model contains the WD40 structure motif repeats, each with a typtophan-aspartic acid (W-D) dipeptide termination. As WD40 proteins often play important roles in signal transduction and transcription regulation [61], the structure prediction suggests that this protein may be related to signal transduction in the salt resistance process. We also applied MUFOLD for structure predictions of hub proteins LOC_Os01g59580.1 and LOC_Os01g46720.1, each of which has a node degree of 30. According to the remote homology detection, LOC_Os01g59580.1 has a template 2QKW, which is in the process of phosphorylation (GO: 0006468), and plays a role as ATP binding (GO: 0005524), protein binding (GO: 0005515) and protein serine/threonine kinase activity (GO: 0004674) [62]. LOC_Os01g59580.1 may have these activities as well. LO-C_Os01g46720.1 has templates 1PKD, 3EZR, 2W06 and 2W17, which are involved in anaphase-promoting complex-dependent, proteasomal ubiquitin-dependent protein catabolic process (GO: 0031145), and cyclin-dependent protein kinase activity (GO: 0004693). The details of predicted structural models are described in Document S3.
Besides exploring the largest module related to phosphorylation activity in a systems biology point of view, we also explored the other 16 identified modules in a similar fashion. Among these modules, Module 3 has divergent functions, but other modules converge to consistent functions related to salt tolerance. Module 2 of cytochrome P450 [63], Module 6 of ubiquitin activity [64,65], and Module 16 of cytokinin dehydrogenase [66,67] are related to ABA mediated in salt tolerance. Module 4 of peroxidase precursor [68], Module 9 of flavonol synthase [69,70], and Module 15 of the aldo/keto reductase (AKR) family [71] are all related to ROSscavenging in salt tolerance. Module 13 of O-methyltransferase [72] is associated with sodium sequestration. Some previous experimental studies on the abiotic stress of plants based on gene expression patterns also linked salt tolerance to aspartic proteinase nepenthesin [73] in Module 5, starch [74] in Module 7, glucosyltransferase [73] in Module 8, glycosyl hydrolases [75] in Module 10, MYB family [73] in Module 11, gibberellin receptor GID1L2 [76] in Module 12, phosphatidylethanolamine-binding [77] in Module 14, and cysteine synthase [78] in Module 17. A detailed analysis on each of these 16 modules can be found in Document S4.

Discussion
In this paper, we focus on the salt tolerance mechanism of root tissue in rice. As there are commonly three samples in just one genotype of one condition on microarray experiments in contrast to tens of thousands of probe sets, it is a great challenge to determine feature selection on this small-sample, but highdimension data. Classical statistical feature selection methods, such as t-test assume the samples follow some specific distribution as its hypothesis; however, the limited number of samples narrows the usage of these statistical methods. From the feature selection prospective, the volcano plot method uses two dimensions of fold change and t-test p-value to select genes in microarray analysis. It is a fast, simple, and widely used method. However, the fold change of each differential expression gene does not necessarily reveal the nature of biological meaning. In this case, using machine-learning methods could be a good alternative in microarray analysis. The improved volcano plot method using some specific criteria like MergeValue based on an SVM-RFE procedure could improve the performance. The improved method used all three salt resistant genotypes as a whole to mine the common pattern of salt tolerance, which helps overcome the disadvantage of a limited number of samples. The improved  method also used a bootstraps approach to make the feature selection more robust. We incorporated the QTL information with transcription profiles to identify genes related to drought response. For a given QTL, there may be 25-30 genes per cM (,270 kbp in rice) [79]. Given so many possible genes in a QTL region associated to a phenotype, the proposed Microarray-QTL test, which used the same mechanism as GO term enrichment analysis, could help evaluate the relative relevance of these QTL genes to the phenotype quantitatively. Furthermore, we also combined predicted protein-protein interactions, protein structure prediction and gene regulatory motif analysis in studying potential genes related to salt tolerance. Such a systems approach is powerful in providing high-confidence predictions of salt-tolerant genes. Our study may provide richer and more concise predictions than a study done by Cotsaftis et al. [80], which only used the expression level of the gene probes in transcript profiling to predict salttolerant genes.
Walia et al. [5] summarized the following components in the salinity response based on their microarray study: 1) adaptive response, 2) non-adaptive response, 3) response to salt injury, and 4) heritable responses conferring tolerance. While these general categorizations are important, they do not provide a detailed mechanism, especially in terms of genes involved and these processes. Our study serves as an attempt to fill this gap. According to our constructed network, one kernel module of phosphorylation activity is detected. The role of phosphorylation in abiotic stress has been actively studied in recent years in addition to its relationship to salinity stress [81], cold stress [82] and heat stress [83]. Our result shows the central role of phosphorylation and redox action in salt tolerance, with implication of activity in signal transduction and oxidation. As our protein interaction networks were constructed from predicted interaction, our predictions of network modules may have significant false positives, which need further biological experiments to validate.
Our study may provide some useful hypotheses for researchers to design experiments for studying salt resistance and some guidance for molecular breeders to improve traits. Since some key proteins have been predicted and mapped to QTLs in our study, which means researchers could conduct experiments to clone and validate these genes. We plan to apply our computational pipeline to study other traits and other species.

Data Source
We obtained rice microarray data from Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/), which contains 14 datasets, 182 platforms, 5,210 samples and 374 series   of Oryza sativa. Among these datasets, we used GSE14403 submitted by Ute Baumann January 13, 2009 and last updated August 2, 2012 to analyze salt tolerance. Unlike other datasets such as GSE3053 and GSE13735, this dataset contains the largest size of samples ever gathered related to salt tolerance of roots. The data were generated from Affymetrix Rice Genome Array (GPL 2025 in GEO), which contains 57,381 probes and each probe corresponded to an individual gene.
In the dataset GSE14403, we used salt resistant genotypes FL478, Pokkali and IR63731, and salt susceptible genotype IR29 under control and salinity-stressed conditions during vegetative growth, which ranged from GSM359902 to GSM359924. We merged these three salt-tolerant plants together as the salt-tolerant group. In the microarray experiments that collected the data [80], seedlings were cultured in sand and irrigated with a nutrient solution for 22 days (salt-treated) and 30 days (control) after germination, respectively. Salinity treatment was applied by adding NaCl and CaCl 2 (5:1 molar concentration) in two steps over a period of 3 days (final electrical conductivity: 7.4 dS m21) to prevent osmotic shock. All plants were harvested on day 30. All the data were collected from the root tissue of the plants. Table 5 gives the specific description of the data used, and the class label (21/1) is according to the different stress conditions. Table 6 shows 17 QTLs (13 unique QTLs) detected with salt tolerance in Gramene (http://www.gramene.org/qtl/). We mapped these QTLs regions using Gramene annotation of the rice genome of MSU Rice Genome Annotation (Osa1) Release 6 [55].
We obtained predicted protein-protein interaction data from Database of Interacting Proteins in Oryza Sativa (DIPOS) (http:// csb.shu.edu.cn/dipos/) [47]. This database used two different but complementary methods, i.e., interologs and domain interactions based methods to predict protein interactions for rice. DIPOS contains 14,614,067 pairwise interactions among 27,746 proteins, covering about 41% of the whole Oryaza sativa proteome.

Data Preprocessing
The preprocessing step intended to overcome the noises, including missing probes and mislabeled probes. We conducted the analysis using Bioconductor [84]. We used Bioconductor's affy package for estimation of expression values by Robust Multi-chip Average (RMA) [85]. The RMA procedure consists of three steps: a background adjustment, quantile normalization and summarization. As there are 123 probe sets designed for control in this microarray, the dataset excluded these probe sets and 57,258 valid probe sets were obtained for further analysis.

Choice of Gene List using Improved Volcano Plot
We proposed an improved volcano plot method to choose genes in this dataset. The improved method has a new measure MergeValue for selecting genes. The detail of this method is described in Document S5.

Validate Gene List from QTL
Firstly, we identified QTLs zone in the genome using Gramene QTL database (http://www.gramene.org/qtl/), where there are 17 QTLs ranging from chromosomes 1,3,4,5,6,7,9. Secondly, we mapped the candidate genes detected by the microarray probe sets to the genome. Thirdly, we defined a criterion named Microarry-QTLs (Eq.(1)) to evaluate the statistical significance of chosen genes related with the specific trait.
where N is the total number of valid genes in the microarray, m is number of valid genes covered by QTLs, n is the number of chosen genes by microarray feature selection method, and k is the number of chosen genes that are covered by QTLs. The Microarray-QTL test follows a Hypergeometric distribution, and the p-value reveals significance of chosen genes related with specific QTLs.

Motif Analysis on Promoters
To determine whether chosen genes are transcriptionally coregulated, a motif analysis could help to find conserved motif regulatory elements in their promoters. We first used MEME [86] to predict motifs on the upstream region of 1,000 bps from the translation start site of the chosen genes. Then we used FIMO [87] to conduct a Chi-squared test on the significance of these motifs in comparison to randomly selected genes in the genome. We compared the identified motifs with known plant motifs from the PLACE database [53] using CompariMotif [88]. For each pair of compared motifs, if their similarity score is more than 4 and the percentage of their matched positions is more than 80%, these two motifs were considered identical.