Using single-index ODEs to study dynamic gene regulatory network

With the development of biotechnology, high-throughput studies on protein-protein, protein-gene, and gene-gene interactions become possible and attract remarkable attention. To explore the interactions in dynamic gene regulatory networks, we propose a single-index ordinary differential equation (ODE) model and develop a variable selection procedure. We employ the smoothly clipped absolute deviation penalty (SCAD) penalized function for variable selection. We analyze a yeast cell cycle gene expression data set to illustrate the usefulness of the single-index ODE model. In real data analysis, we group genes into functional modules using the smoothing spline clustering approach. We estimate state functions and their first derivatives for functional modules using penalized spline-based nonparametric mixed-effects models and the spline method. We substitute the estimates into the single-index ODE models, and then use the penalized profile least-squares procedure to identify network structures among the models. The results indicate that our model fits the data better than linear ODE models and our variable selection procedure identifies the interactions that may be missed by linear ODE models but confirmed in biological studies. In addition, Monte Carlo simulation studies are used to evaluate and compare the methods.


Introduction
Gene regulatory networks (GRN) are complex and dynamic systems in nature.They are composed of genes that interact with each other and with other substances inside cells, such as RNAs and proteins.Over the past few decades, a variety of methods have been proposed to model GRN.Commonly used models include information theory models, Boolean networks, ordinary differential equation (ODE) models, and Bayesian networks [1].Information theory models [2][3][4] construct network architecture on correlation coefficients.Such models are simple and have a low computation cost, but cannot take into account the dynamic processes and situations when multiple genes participate in regulations.Boolean networks [5][6][7] are discrete dynamic networks and easy to understand, but have limitations because their networks' nodes are binary states: "off" or "on".Due to these simplifying assumptions, the study of kinetic gene regulation is still challenging because of the complexity of the biological process [8].
The Bayesian networks [9][10][11][12] integrate biological knowledge and measurements to infer network structures.But the estimated results obtained from Bayesian networks depend on the quality and completeness of prior knowledge.As pointed out by [13], the existing ODE models and associated methods used to study GRN are flexible but are limited to small scale gene expression levels.ODE models describe the dynamic behaviors of GRN in a quantitative manner and represent gene expression level changes by functions of gene expression levels: dX k ðtÞ dt ¼ Fðt; XðtÞ; θÞ; k ¼ 1; . . .; p; ð1Þ where X(t) = (X 1 (t), Á Á Á, X p (t)) T represents gene expression levels at the time t of the p genes; F(Á, Á, Á) is a function which can be linear or nonlinear; and θ is an unknown parameter vector which quantifies the regulations or interactions among the genes in GRN.
Once we can determine X(t), the gene expression levels which should be included in the ODE model (1), we can infer the interactions within a dynamic GRN.This motivates us to use appropriate models and to develop associated techniques in order to construct dynamic GRN for time course gene expression data.Within a dynamic GRN, the majority of the genes are not significantly relevant to each other.The precision of parameter estimation, model interpretability, and the accuracy of forecasting will be reduced when irrelevant genes are included in models [14].Thus, those irrelevant genes should be excluded from the final model.However, variable selection for ODE models using traditional statistical methods is important but challenging, especially when it comes to dynamic GRN.The difficulties arise from two aspects: one is the collinearity among genes, i.e., genes sharing same "pathway" are highly correlated in expressions; the other is the high-dimensional feature of GRN, i.e., a large-scale GRN involves hundreds or even thousands of genes.When the number (n) of measurements for individual genes is much smaller than the number (p) of genes, traditional statistical methods face significant challenges in developing statistical procedures and deriving theory [15].
Pioneering research has investigated gene regulatory networks using variable selection techniques.For example, [13] proposed linear ODE models: dX k (t)/dt = γ T X(t) and developed a variable selection procedure based on SCAD penalty.[13] further employed their method to construct a module-based dynamic network.However a linear ODE model has many limitations and is unable to capture certain patterns.In reality, the first derivatives of the gene expression profiles (the time-related changes of a gene expression) can be quantified as a function of gene expression levels of all related genes.The link functions that quantify the regulatory effects of genes on the first derivatives may be nonlinear.In other words, systems of cellular regulations may be nonlinear [1,16].Due to the limitations of linear ODE models, developing a flexible modeling approach to explore the interactions among genes has become necessary.When the linear assumption cannot be satisfied, it is natural to consider a singleindex model, E(Y|X) = η(X T β) with η being an unknown differentiable function and β an unknown parameter to be estimated.Single-index models have many advantages, such as being able to model the curvature of a smooth curve and circumventing the so-called "curse of dimensionality".More discussions about the usefulness of single-index models are provided in [17].A nonlinear ODE model (given the function η) may suffer from misspecification and "the curse of dimensionality", whereas single-index ODE models can avoid these two problems and are more flexible, and the index parameter (β) can be estimated with the root-n convergence rate though the link function is unknown.More importantly, single index ODE models allow the predictors to have interactions, which is common in characterizing gene-gene regulation.
Various methods have been proposed to estimate regression coefficients for single-index models.See [18][19][20][21][22][23] for parameter estimators.In addition, much research has been done on variable selection for single-index models.For example, [24] developed a variable selection method based on sliced inverse regression.[25] proposed a leave-m-out cross-validation method to select variables in a single-index model.[26] proposed semiparametrically efficient profile least-squares estimators for parameter estimation, and employed the SCAD approach to simultaneously select variables and estimate regression coefficients.[27] studied estimation and variable selection coupling with dimension reduction procedures.
Although parameter estimation and variable selection for single-index models have gained fruitful results, to the best of our knowledge, no method that couples single-index models with ODE to study dynamic GRN is available.In this paper, we propose a single-index ODE model to study dynamic GRN with the aim of overcoming the inadequacy of linear ODE models.This model can be written as where η k (Á) is an unknown differentiable function; β ½k 0 is a parameter vector with and the first element of β ½k 0 is positive (for identifiability), where k Á k denotes the Euclidean norm.X(t) = (x 1 (t), Á Á Á, x p (t)) T are state functions.Here X(t) can be gene-expressing levels of genes or population mean curves for functional modules.To study the interactions within dynamic GRN, one needs to identify the relevant X(t) for ODE models, that is β ½k 0 6 ¼ 0. We therefore apply the penalized least-squares approach for this aspect and for estimating dynamic parameters β ½k 0 .We will apply the mixed-effects nonparametric model with a mixture distribution framework to cluster the genes into functional modules in the first step.This clustering approach allows us to build the module-based dynamic network and identify the interesting functional modules.These interesting modules may play important roles in 'dynamic' regulations.Although these interesting modules may contain many genes with heterogeneous functions, it can allow scientists to focus on the genes in each module for further investigations.As shown in Fig 1 (below), most gene expression levels can be grouped in several clusters.In each cluster, these expression levels share a similar pattern.The genes in a cluster (represented by a node) may play a common function in biological procession.Such a network can single out regulator-regulator interactions which are helpful to avoid tedious experiments and to speed biological studies.
In Section of Methods, we briefly describe the procedure for GRN construction with details for penalized profile least-squares (PPrLS) estimation and variable selection.In Section of Numerical Results, we construct a module-based GRN structure by using PPrLS estimator for the yeast cell cycle gene expression data with additional results (S1 and S2 Tables), and conduct Monte Carlo simulation studies to evaluate the performance of the proposed procedure.The simulation settings were designed to mimic the gene expression patterns from the real data example.In Section of Discussions, we conclude the article with a brief discussion.All theory and associated technical details are given in the supporting materials (S1-S7 Files).

Methods
Time-course gene expressions are synchronized to many ongoing biological processes such as tissue repair, cell differentiation, or cell cycles [28,29].Through understanding the genes underlying the cell cycles, we can study the mechanisms of many diseases at a molecular level and in turn provide potential drug targets for treating those diseases.From the end of the last century, the identification of cell cycles associated genes has attracted considerable attention in biological study.For example, [28,30] performed genome-wide transcriptional analysis of the cell cycle process of yeast using microarrays and identified about 800 cell-cycle-regulated genes.GRN include genes, the products of genes, and the interactions among them, which together affect many cellular processes.To understand the dynamic mechanism of cellular processes, modeling and analysis of dynamic gene regulatory networks using time-course gene expression data has attracted much attention.We model dynamic network for functional modules based on observed time course gene expression levels in three steps.Step 1. Group genes into functional modules using the smoothing spline clustering (SSC) approach [31][32][33].Final number of clusters was selected by the Bayesian information criterion (BIC), and penalty parameters were determined by the leave-one-out cross-validation procedure (GCV); Step 2. Estimate state function X(t) and first derivative X 0 (t) for each functional module using nonparametric mixed-effect models (NPME) and the spline method respectively; Step 3. Select modules and estimate dynamic parameters using the PPrLS procedure given below, for which tuning parameter was selected by the BIC, and bandwidths were determined by the GCV.
We now describe the details for these three steps. Step

1-Clustering process
We assume the time-course gene expression levels for gene i can be represented by a smooth function of time and follow a mixture Gaussian distribution: where p k , k = 1, Á Á Á, p are the probability that gene i belongs to cluster k; μ k , k = 1, Á Á Á, p and S k , k = 1, Á Á Á, p are the vector representations of the mean curve and variances components for each cluster respectively.The gene expression levels for individual genes are assumed to follow an overall mean curve (fixed-effect) while having a gene-specific shift (random-effect).Therefore nonparametric mixed-effect model can be constructed by fitting the time-course gene expressions for each gene to a function over time by using the smoothing spline method.
Through maximizing the penalized log-likelihood, the SSC procedure estimates the probabilities p k .The means μ k and variances component S k can be estimated as by-products also.More details about the SSC procedure are available in [32] and [33]. Step

2-Applications of nonparametric mixed-effect models
After grouping genes into functional modules, we apply NPME models to estimate the state function X(t) and its first derivative X 0 (t) for each functional module.For notation simplicity, we consider the estimation of the state function and its first derivative for module 1 (k = 1) and denote them by X(t) and X 0 (t) respectively.Suppose the number of genes in module 1 is m, and the number of measurements collected from each gene is m i .The NPME model can be described as where g i (t) is the observed gene expression level for the i th gene; X(t) presents the fixed-effect or population curve which reflects an overall time-related trend of the gene expression level for module 1; v i (t) describe individual curve variations; ε i (t) are measurement errors; and v i (t) and ε i (t) are assumed to be independent.We can combine the penalized spline [34][35][36] with the linear mixed-effects (LME) modeling framework [37] to approximate X(t).For presentation completeness, we briefly summarize the estimation procedure.We first approximate X(t) and v i (t) by e XðtÞ and e v i ðtÞ, respectively, which are expressed as: The approximation of model ( 4) can be expressed as 6) is a standard LME model.As a result, α, b, u and w can be estimated by using the function lme (available in the R package nlme).Substituting the estimated b α and b u in Eq (5), we estimate the X(t) for module 1.After estimating X(t), we apply the spline method (available in the R package splines) to estimate the first derivative of b XðtÞ.The detailed estimation procedure is referred to [38] and [39].
Step 3-Estimation procedure based on the penalized profile least-squares approach Suppose a genome-wide time course gene expression levels were clustered into p modules; X j (t), j = 1, Á Á Á, p are the population mean curves estimated by NPME models; and b X 0 k ðtÞ are the estimates of the first derivative dX k (t)/dt for the k-th module.Substituting X j (t), j = 1, Á Á Á, p and the first derivative b X 0 k ðtÞ for the k-th module in model ( 2), we obtain a single-index ODE model for the k-th functional module which can be written as T , and ε is the sum of numerical errors due to integration and estimation.
This complexity ε makes it challenging to study the properties of the proposed estimator for β 0 s, For simplicity, we adopt an additive error model used in the literature [13,40,41].Once the X(t) can be identified, we construct a module-based network.Here we develop the variable (population mean of functional modules) selection and estimation procedure for model (7) based on the penalized profile least-squares approach as follows.
Selecting variables by penalized least squares has been widely studied in literature.See, for example, the least absolute shrinkage and selection operator (LASSO) [42], the smoothly clipped absolute deviation (SCAD) approach [43], the adaptive lasso estimator [44], the elastic-net estimator [45] and the adaptive elastic-net estimator [46].However, the variable selection problem for single-index ODE models has not been addressed in the literature.In this paper, we extend the approach proposed by [26] to the single-index ODE model (7).
Let p be the number of all modules; X i = (X 1 (t i ), . .., X p (t i )) T Let L i ¼ X T i β ½k .η k (u) can be estimated utilizing the local linear regression method [47], i.e., minimizing with respect to a k and b k , where The above estimation procedure can be used when the true model is known a priori.Because we wish to identify GRN structure and enhance the predictive power of a proposed model, we apply the penalized least-squares approach to simultaneously select modules and estimate parameters.Define a penalized profile least-squares (PPrLS) function where p λ [k](Á) is a penalty function with a regularization parameter λ [k] .The PPrLS estimator of β [k] is the minimizer of Eq (12); i.e., b For a given tuning parameter λ [k] , we can estimate β [k] by minimizing L P ðβ ½k Þ with respect to β [k] .By determining non-zero β [k] , we identify the modules having impacts on the kth module and therefore construct GRN.
There are various penalty functions in the literature of variable selection for semiparametric models.Considering the SCAD method has many good theoretical properties, we adopt the SCAD penalty function [43], and adopt BIC selector proposed by [48] to choose the regularization parameters λ [k] by minimizing the following objective function: is the number of nonzero coefficients of b β ½k l ½k , the PPrLS obtained from (12) for each λ [k] .Remark.Although the proposed method needs three steps to implement and its computational cost is high, compared to the existing methods, its gain in computational efficiency is significant.Most of dynamic network models such as dynamic Bayesian networks and random graph models require extensive computations for posterior inference.As a result, Bayesian based methods allow one to deal with only small networks.The proposed method can avoid numerically solving the differential equations directly, and does not need the initial or boundary conditions of the state variables.The method also incorporate the high-dimensional ODEs to allow us to perform variable selection and parameter estimation for one equation.These good features gain computational efficiency.

Real data analysis
We used the procedure introduced in Section of Methods to analyze a time-course yeast cell cycle gene expression data set.These 297 genes were identified as expressions across 18 time points during approximate two cell cycles; i.e., each gene has 18 time-related observations [49].
We implemented Step 1 using the MFDA function (available in the R package MFDA), and identified 12 functional modules.The population mean curves for the functional modules are given in Fig 1 .We can see that for each functional module, the genes included share a similar pattern.These time-related patterns show two cell cycles (Fig 1).The number of genes included in each functional module ranges from 9 to 53.
In order to construct a functional landscape of the genome-wide regulatory network through identifying interactions among modules, we used the Database for Annotation, Visualization and Integrated Discovery [50,51] to identify enriched functional annotations in Gene ontology and Kyoto Encyclopedia of Genes and Genomes pathways for each functional module.A modified Fisher exact test was used to test the null hypothesis that a certain function is not over-represented in the module compared to the background population.Due to space limitation, we displayed part of the selected functional annotations in Table 1.All enriched functional annotations were provided in S1 Table.
As shown in Table 1, the function annotation analysis suggested that genes in the identified functional modules participate in broad biological process such as cell cycle, DNA replication or packaging, meiosis, regulation of transcription etc.For example, module 3 was highly enriched in DNA packaging; module 7 was enriched in cell-division cycle and mitosis; and DNA metabolic process was related to module 12.Although each functional module has multiple enriched annotations, but most annotations can be grouped into one or two clusters.
After grouping genes into functional modules, we applied step 2 to all functional modules, and obtained X i (t) and b X 0 i ðtÞ; i ¼ 1; Á Á Á ; 12. Following the data augmentation strategy used in [13,52,53] and [54], we selected 300 time points from X i (t) and the first derivative b X 0 1 ðtÞ for the module.Therefore, the sample size is N = 300.After substituting the estimates into singleindex models, we built the full model for module 1, for instance, as follows.
where the response variable y 1 ¼ b X 0 1 ðtÞ, the estimated first derivatives; X(t) = (X 1 (t), . .., X 12 (t)) T are the population mean estimates of 12 functional modules; and β ½1 0 ¼ ðb ½1 01 ; . . .; b ½1 012 Þ T .Applying the PPrLS procedure given in step 3 to model (15), we detected significant variables X(t) and obtained nonzero b β ½1 .As a result, we identified the modules related to the gene-expression changes of the module 1.For a comparison, we also fitted y to X(t) by using a linear ODE model [13] We also selected XðtÞ and estimated β ½1 L0 ¼ ðb ½1 L01 ; . . .; b ½1 L012 Þ T by applying the SCAD method to the linear ODE model (16).
Applying the procedure to all functional modules, we constructed a regulatory network among modules (Figs 2 and 3) and estimated their corresponding dynamic coefficients by both single-index and linear ODE models.
To compare the results provided by the single-index ODE and linear ODE models, we summarized the inward (significantly impact on) and outward (impacted by) regulatory relationships between modules in Table 1.The number of genes in each module was displayed in the parentheses.One can see that the residuals of sum squares (RSS) of single-index ODE models were smaller than those of the linear ODE models.We can also observe that the single-index ODE models selected more modules than the linear ODE models did.For example, the singleindex ODE model indicated that module 2 was impacted by modules 1, 2, 7, 8, 9 and 12 of which only modules 1, 7 and 9 were selected by the linear ODE model.Both linear ODE and single-index ODE indicated that modules 3, 7 and 8 were important because they regulated more than 50% modules.We also noted that module 8 only included 9 genes.Further experiments are needed to explore these new discoveries in biological progression.

A simulation study
In this part we conducted Monte Carlo simulation studies to validate the proposed procedure for the single-index ODE models.Due to the intensive computational cost, we designed a system with 7 ODEs, which include following linear and nonlinear forms.The simulation settings cellular carbohydrate biosynthetic process (0.006),

10E-03
2.67E-04 module8 (9) cell cycle (0.007),regulation of cell cycle (0.025) 3, 5, 7, 8 Glycosylation (< 0.001), mitotic cell cycle (< 0.001), nuclear division (< 0.001) regulation of cell cycle (< 0.001), regulation of cell cycle process (0.001) cell cycle (0.043),nuclear migration along microtubule (0.012) mitotic recombination (< 0.001), DNA metabolic process (< 0.001) are data-driven because the gene expression pattern show sine and cosines patterns (Fig 1) Given initial values X p0 , p = 1, . .., 7, we can numerically solve the above ODE system and obtain the numerical solution X p (t), p = 1, . .., 7.In this simulation study, we first generated where e p follows N(0, 1) and X 0 = (0.7628, 0.6789, 1.2351, 0.6170, 2.7800, 0.2906, 0.4441).We then numerically solved the ODE system (17) and output X p (t), p = 1, . .., 7, using three different schedules: equally spaced time points on the ranges of [0, 18], but three different intervals between time points.As a result, we simulated seven population means X p (t i ), p = 1, . .., 7, i = 1, . .., N with sample sizes N = 180, 288, 360.After generating the population mean curves, we used the spline method to estimate the first derivatives, which are denoted by b X 0 p ðtÞ, p = 1, . .., 7.For notation simplicity, we gave the model structure and estimation procedure for the first ODE (k = 1).The same procedure can be applied to the rest of the ODEs.Substituting the generated X p (t), p = 1, . .., 7 and estimated b X 0 1 ðtÞ into the single-index ODE models, we obtained the largest model for the first ODE as follows: where Applying the procedure given in Step 3, we selected X(t) and estimated β ½1 0 for the first ODE.Applying the same procedure to the other six ODEs, we estimated β ½k 0 , k = 2, . .., 7.As a result, we constructed GRN for the simulated functional modules.We repeated the same procedure 100 time and summarized and ARE q ¼ P 100 j¼1 j b b qj À b 0q j jb 0q j ; q ¼ 1; . . .; 14, where b b qj is the estimated β q for j th iteration.In Table 2, "overfitted (O)" represents extra variables; "underfitted (U)" represents incorrectly deleting necessary variables.We can see that the PPrLS method can correctly select the variables for most cases in terms of the number of correctly fitted model.Larger sample sizes lead to better performance.For ODEs with a linear form, namely ODE1, ODE4, and ODE7, both variable selection and parameter estimation procedures have good performance when the sample size is 180.For the nonlinear case, with the increase of the sample size, both variable selection and parameter estimation tend to work better.In addition, we reported the 10% trimmed MSE and ARE (discarding 5% of the lowest and the highest values).Meanwhile, we constructed networks among simulated functional modules for each iteration (see Figs 4,5 and 6).The thick lines represents true connection, and the numbers present the times which were found by our method in 100 iterations.From Figs 4, 5 and 6, we can see that the constructed GRN match the true network in most cases.

Conclusions and discussions
In this paper, we have proposed single-index ODE models and developed a procedure to select variables and estimate parameters.The procedure has further been used to analyze a timecourse data set with the aim of exploring the module based and regulator-regulator interactions.We found the interactions identified by using single-index ODE were more accurate, i.e., the linear ODE models overlooked some confirmed regulator-regulator interactions [55].We took module 12 as an example.MBP1 is a DNA-binding protein that forms MBF complex; a protein complex that binds to the Mlu1 cell cycle box promoter element.[56,57] showed that MBP1 is topologically related to transcription factors, including SWI4 in Saccharomyces cerevisiae.In addition, there is physical and genetic evidence that MBP1 interacts with SKN7, a transcription factor [58].These two interactions are identified as potential interactions in module 12 by single-index ODE models, but are overlooked by the linear ODE models.The advantages of our method include (i) single-index ODE models can fit the data better than linear ODE models; (ii) the interactions found by single-index ODE models can cover most of the interactions identified by linear ODE models for some of the modules; and (iii) our method is computationally efficient because we can select significant modules and estimate index coefficients simultaneously.
Similar to the linear ODE model, our method needs estimated population means and their corresponding first derivatives, which may be treated as the limitation of the proposed procedure.The PPrLS estimator has good performance in identifying significant modules.But new stable techniques are still needed to group genes to reduce the gene cluster uncertainty because cluster assignment still plays an important role in enhancing the usefulness of this research.It is worth noting that the PPrLS estimates may not be most efficient in terms of estimation accuracy because PPrLS estimation is a nonparametric method and inherits error if the data  contain a large noise.Regulator-regulator interaction exploration depends on the knowledge of gene-regulator relationship, which we study.So the proposed method may provide valuable insights into complicated biological processes with understanding gene-gene and gene-regulator relationships.Overall, our procedure is useful to single out high level (module based) and potential regulator-regulator interactions which are helpful to provide guidance for tedious and costly experiments.

Fig 1 .
Fig 1.The scatterplot of gene expressions against time (the same color for each individual gene in a module) and the population mean curve (solid line) of 12 modules for the time course yeast cell data set.https://doi.org/10.1371/journal.pone.0192833.g001 be the vector representations of the mean curves of p function modules and the estimates of the first derivative dX k (t)/dt for the k-th module.Assume the functional data of the kth module follows the single-index ODE model /doi.org/10.1371/journal.pone.0192833.t001

Fig 2 .
Fig 2. The GRN identified by the linear ODE models for the time course yeast cell data set.Each node represents a module and the arrows presents the direction of influence.https://doi.org/10.1371/journal.pone.0192833.g002

Fig 3 .
Fig 3.The GRN identified by the single-index ODE models for the time course yeast cell data set.Each node represents a module and the arrows presents the direction of influence.https://doi.org/10.1371/journal.pone.0192833.g003

Fig 4 .
Fig 4. The constructed gene regulatory networks for simulation studies with N = 180 and 100 iterations.Solid lines: the true connections, numbers present: the times correctly identified using our procedure in 100 iteration, dots line: incorrectly identified connections.https://doi.org/10.1371/journal.pone.0192833.g004

Table 2 . The simulation results for the SCAD method for scenarios with different sample sizes based on 100 replications.
The simulation results for the SCAD method for scenarios with different sample sizes based on 100 replications.Correctly fitted (C); underfitted (U); overfitted(O).