System identification of signaling dependent gene expression with different time-scale data

Cells decode information of signaling activation at a scale of tens of minutes by downstream gene expression with a scale of hours to days, leading to cell fate decisions such as cell differentiation. However, no system identification method with such different time scales exists. Here we used compressed sensing technology and developed a system identification method using data of different time scales by recovering signals of missing time points. We measured phosphorylation of ERK and CREB, immediate early gene expression products, and mRNAs of decoder genes for neurite elongation in PC12 cell differentiation and performed system identification, revealing the input–output relationships between signaling and gene expression with sensitivity such as graded or switch-like response and with time delay and gain, representing signal transfer efficiency. We predicted and validated the identified system using pharmacological perturbation. Thus, we provide a versatile method for system identification using data with different time scales.


Introduction
In intracellular signaling systems, information of an extracellular stimulus is encoded into combinations of distinct temporal patterns of phosphorylation of intracellular signaling molecules that are selectively decoded by downstream gene expression, leading to cell fate decisions such as cell differentiation, proliferation, and death [1,2]. For instance, in rat adrenal pheochromocytoma PC12 cells, nerve growth factor (NGF) induces cell differentiation mainly through sustained phosphorylation of ERK [3][4][5][6], whereas pituitary adenylate cyclase-activating polypeptide (PACAP) induces cell differentiation mainly through cAMP-dependent CREB phosphorylation [7][8][9][10][11]. We showed that cell differentiation in PC12 cells can be divided into two processes: a latent process (0-12 h after the stimulation) in preparation for neurite extension and a subsequent neurite extension process (12-24 h) [12]. We identified the three genes essential for cell differentiation, Metrnl, Dclk1, and Serpinb1a, which are induced during the latent process and required for subsequent neurite extension, and named LP (latent process) genes [9]. Although NGF and PACAP selectively induce the different combinations and temporal patterns of signaling molecules, both growth factors commonly induce the LP genes [9]. The expression levels of LP genes, but not the phosphorylation level of ERK, correlate with neurite length regardless of growth factors [9], indicating that the LP genes are the decoders of neurite length. Thus, how the distinct patterns of signaling molecules are decoded by LP gene expression is critical for understanding the unknown mechanism underlying cell differentiation in PC12 cells. Decoding the combinations and temporal patterns of signaling molecules by downstream gene expression is a general mechanism underlying various cellular functions [1,2,13].
Mathematical modeling is useful for the analysis of decoding mechanisms [14]. If the signaling pathways are well characterized, kinetic modeling based on biochemical reactions reported in the literature is often used [15][16][17]. For example, growth factor-dependent ERK activation in PC12 cells has been modeled by the kinetic model based on prior knowledge of pathway information [18][19][20][21][22][23]. In general, however, decoding by downstream genes involves more complex processes such as transcription and translation and information on the precise pathway is not available.
To identify decoding mechanisms by gene expression, the system identification method (also referred to as data-driven modeling) was developed for identifying quantitative inputoutput relationships from time series data without detailed knowledge of signaling pathways [15][16][17][24][25][26]. We previously developed a system identification method based on time series data of signaling molecules and gene expression, denoted as the nonlinear autoregressive exogenous (NARX) model, and applied it to the signaling-dependent immediate early gene (IEG) expression during cell differentiation in PC12 cells [10]. The NARX model involves the determination of lag-order numbers and use of the Hill equation and the linear autoregressive exogenous (ARX) model [10]. Determination of lag-order numbers infers the selection of input molecules (Input) for an output molecule (Output), which is referred to as the Input-Output (I-O). The Hill equation characterizes sensitivity with a nonlinear dose-response curve [27]. The linear ARX model characterizes temporal changes with time constant and gain, the latter of which is an I-O amplitude ratio, and indicates signal transfer efficiency [28]. The advantages of the NARX model rather than kinetic model is systematically presentation of model candidates without prior knowledge of signaling pathway and less computational cost for parameter estimation. However, the NARX model requires equally spaced dense time series data. If the time scale between upstream and downstream are similar, such as signaling molecules (scale of tens of minutes) and IEG expression (a few hours) in PC12 cells, it is not difficult to acquire a sufficient number of equally spaced dense time series data [10]. However, if the time scale of upstream and downstream molecules is largely different, such as signaling molecules (tens of minutes) and LP gene expression (a day) [29], it is technically difficult to obtain sufficient equally spaced dense time series data because of experimental and budget limitations.
Measuring gene expression often requires a longer time scale than measuring protein phosphorylation. Obtaining equally spaced dense time series data with a longer time scale is labor and cost intensive, because, unlike live-cell imaging experiments, snapshot experiments such as western blotting, RT-PCR, and quantitative image cytometry (QIC) [12] require the same number of experiments as the number of time points. In addition, experimental noise and variation increases as the number of experiments increases because differences in experimental conditions such as plates, gels, reagents, and cell culture conditions increase as well. Therefore, in reality, for a longer time scale experiment, unequally spaced sparse time series data rather than equally spaced dense time series data are desired. For example, under conditions in which stimulation by cell growth factors triggers rapid and transient phosphorylation and slow and sustained gene expression, time series data should be obtained with dense time points during the transient phase and eventually with sparse time points. The timing and dynamic characteristics of temporal changes may differ between upstream and downstream molecules, such that time points and intervals for measuring upstream and downstream molecules may be different. Thus, a system identification method using unequally spaced sparse time series data with different time scale needs to be developed.
To solve this problem, here we used the signal recovery technique based on a low-rank approach proposed in the field of compressed sensing to generate a sufficient number of time points for equally spaced dense time series data from unequally spaced sparse time series data with different time points and intervals. We applied this nonlinear system identification method to the signaling-dependent gene expression underlying cell differentiation in PC12 cells and identified the signaling-decoding system by gene expression.
Unequally spaced sparse time series data can be regarded as equally spaced dense time series data with missing time points, and therefore we can generate equally spaced dense time series data by applying a signal recovery technique, which has been studied in the field of compressed sensing [30,31]. Compressed sensing is a signal processing method for efficient data acquisition by recovering missing signals/images from a small number of randomly sampled signals including unequally spaced sparse data based on sparseness of a vector [32] or low rankness of a matrix [33]. Both the sparse approach and the low-rank approach have been applied to various fields, such as sampling and reconstructing magnetic resonance images [34,35], super-resolution imaging [36,37], image inpainting [38,39], and collaborative filtering [40]. In this study, we applied a matrix rank minimization algorithm [41] to recover missing time points from unequally spaced time series data, and we generated equally spaced time series data with the same time points from signaling and gene expression data with different time scales. We previously developed a system identification method from equally spaced dense time series data of signaling and gene expression using the NARX model [10]. We developed a new system identification method from unequally spaced sparse time series data with different time scales by integrating this signal recovery method using the matrix rank minimization algorithm [41] and the NARX model [10]. We applied the method to system identification of signaling-dependent gene expression in cell differentiation in PC12 cells, revealing a selective signaling-decoding mechanism by gene expression.

Results
Signal recovery using compressed sensing from unequally spaced data In this study, we regarded unequally spaced sparse time series data as equally spaced dense time series data with missing time points, and equally spaced time series data were generated by restoring missing time points using a low-rank approach [41]. In the low-rank approach for image recovery, we assumed that the value of each pixel is represented by a linear combination of its neighbor pixels, which is mathematically represented by an autoregressive (AR) model. Then a Hankel-like matrix composed of pixel values has a low rank because each column is represented by the linear combination of the other columns (Fig 1 and S1A Fig). This means that the Hankel-like matrix is a low-rank matrix whose rank is determined by the system order. Missing data can be recovered by estimating missing elements of the matrix so that the rank of this matrix [Y] is r. When system order r is unknown, based on the idea that the system can be described with as few parameters as possible, missing elements of this Hankel-like matrix are recovered so as to minimize the rank of the matrix [Y]. Based on the low rankness of the Hankel matrix, the signal recovery problem of the missing pixels can be formulated as a matrix rank minimization problem, and we can restore an image by solving this problem [38,39] (Fig 1).
We performed system identification from unequally spaced time series data of input molecules (Inputs) and output molecules (Outputs). Although an AR model is used for image recovery, we used an ARX model where the value at a time point is represented by a linear combination of two kinds of signals, Inputs and Outputs. Therefore, we modified the rankminimization-based signal recovery method of the AR model to the ARX model and performed system identification (Fig 1 and S1B Fig). Several methods for system identification using a linear ARX model with signal recovery of missing points of input and output based on matrix rank minimization have been proposed [42]. They can recover missing time series input-output data even when missing time points of input are not equal to those of output.
However, we cannot directly apply the method because we used the NARX model rather than the ARX model due to the nonlinearity of signaling-dependent gene expression [10,43]. Therefore, by combining the nonlinear ARX system identification method [10] and the signal recovery method based on the matrix rank minimization problem [41], we derived the signal recovery algorithm applicable to the nonlinear ARX system and performed system identification using recovered equally spaced time series input-output data (see "NARX Model and Data Representation" and "Extension ARX system identification from unequally spaced time series data to the NARX system" sections in Materials and methods).

System identification by integrating signal recovery and the NARX model
In the NARX model used in our previous work, time series data of Inputs are nonlinearly transformed using the Hill equation, which are then used as inputs for the ARX model [10] (Fig 2A). The Hill equation, which is nonlinear transformation function f(x) widely used in biochemistry [27], can represent sensitivity with a graded or switch-like response by the values of n and K (Fig 2A). The ARX model in the NARX model can represent how the Output efficiently responds to the temporal change of the nonlinearly transformed Inputs by the time constant and gain (Fig 2A). Thus, from the estimated parameters of the Hill equation and ARX model, the sensitivity with graded or switch-like response and the time constant and gain are obtained, respectively. In this study, the parameters of this NARX model were estimated using a signal recovery scheme based on a low-rank approach [41], as follows ( Fig 2B, see details in "Procedure for system identification by integrating signal recovery and the NARX model" section in Materials and methods). An example of recovered equally spaced image data from unequally spaced image data by the signal recovery technique using rank minimization of the Hankel-like matrix Y composed of signals in pixels. We assume that the value of each pixel is represented by a linear combination of those of its neighboring pixels, which is mathematically represented by an autoregressive (AR) model. The original picture is published under the Creative Commons Zero license in https://www.pexels.com/photo/animal-black-and-white-cute-funny-164703/. (Bottom) An example of recovered equally spaced time series data from unequally spaced time series data by the signal recovery technique using rank minimization of the Hankel-like matrices Y and U, composed of time series data of input molecules (Inputs) and output molecules (Outputs), respectively. We assume that the value at a certain time is represented not only by the linear combination of values of the output molecule at past points but also by the linear combination of the values of the input molecule at past points, which is mathematically represented by an autoregressive exogenous (ARX) model. The recovered time series input-output data have the equally spaced time series data with the same time points even if the missing time points of input and output are different.  To estimate the I-O relationship, we selected a combination of Inputs for each Output and prepared a data set of all combinations of Inputs for each Output. Each data set was divided into test dataset for one stimulation condition and training data set for the rest of two stimulation conditions, leave-one-out (LOO) cross-validation was performed. We estimated the parameters of the NARX model (the NARX parameters) for the training data set in each Input-Output combination using the following method. First, the initial values of n and K are given by n = 1 and a random number, respectively, and nonlinear transformation of input unequally spaced time series data by using the Hill equation was performed (Fig 2B, step i). A Hankel-like matrix was constructed from unequally spaced time series Output data and from unequally spaced time series Inputs nonlinearly transformed by the Hill equation. Next, signal recovery was performed with an iterative partial matrix shrinkage (IPMS) algorithm to minimize the rank of the Hankel-like matrix composed of Output and Inputs transformed by the Hill equation [41] (Fig 2B, step ii). The rank of the recovered Hankel-like matrix corresponds to the lag order; from the recovered Hankel-like matrix, the parameters of the ARX model a and b were uniquely obtained ( Fig 2B, step iii; S1B Fig). Further estimation of n and K was performed by using recovered data, ARX parameters obtained until step iii, and other combination of n and K given random numbers ( Fig 2B, step iv). By using the inverse function of the Hill equation, we recovered the missing time points data of input before transformation by the Hill equation. For the other 200 combinations of n and K given by random numbers, we performed simulation of the NARX model using the recovered data, ARX parameters obtained until step iii, and the given combination of n and K.
We calculated the Akaike information criterion (AIC) from the residual sum of squares between the experiment and simulation, number of parameters, and number of data to determine the parameters n and K. AIC is a measure of the relative quality of statistical models based on the trade-off between the goodness-of-fit of the model and the complexity of the model [44]. In step iv, we selected the combination of n and K with the minimum AIC and carried out signal recovery again using these n and K. We repeated steps i-iv 500 times, and selected the n and K and ARX parameters that minimize AIC for the training data set AIC training in total ( Fig 2B, step v). Let parameters with minimum AIC training be parameters obtained from the training data set ( Fig 2B, step v). Once these parameters were obtained, test data (still unequally spaced time series data) was added to the recovered Hankel-like matrix and signal recovery of the test data was performed ( Fig 2B, step vi). With the parameters of the NARX model estimated from the training data set, we simulated the NARX model for test data and calculated RSS s LOO , the residual sum of squares between experiment and simulation for test data set by stimulation condition s (NGF, PACAP, or PMA) ( Fig 2B, step vii).
Because RSS s LOO was obtained for each combination of training and test data set s, we took the sum of RSS s LOO for test data set s as RSS LOO (Fig 2B, step viii). We obtained the combination of Inputs as the identified I-O relationship that minimizes RSS LOO for all combination of Inputs ( Fig 2B, step ix). The I-O relationship indicates that a set of Inputs are selected as upstream molecules for each Output. In the final step, using this combination of input molecules, the parameters of the final NARX model were estimated by the procedure from step i to step v using all stimulation conditions as training data sets ( Fig 2B, step x). Note that we used two different criterions AIC training and RSS LOO ; AIC trainig to determine n, k, and ARX parameters, and RSS LOO to select Inputs in order to save computational cost.
These estimated NARX parameters were used for further study (Figs 3-6). The sensitivity with graded or switch-like response was obtained from the parameters of the Hill equation, and the gain and time constant were obtained from the parameters of the ARX model ( We applied this method to identify the signaling-decoding system by gene expression underlying cell differentiation in PC12 cells using unequally spaced time series data with different time scales.

System identification of signaling-dependent gene expression
We stimulated PC12 cells by NGF, PACAP, and PMA and measured the amount of phosphorylated ERK1 and ERK2 (pERK) and CREB (pCREB) and protein abundance of products of the IEGs, such as c-Jun, c-Fos, Egr1, FosB, and JunB by using QIC [45] (Fig 4). We chose these growth factors because they use different signaling pathways: NGF, PACAP, and PMA use Ras-, cAMP-, and PKC-dependent signaling pathways, respectively [7,11,46,47]. We also measured mRNA expression of LP genes such as Metrnl, Dclk1, and Serpinb1a using qRT-PCR (Fig 4). We measured the signaling molecules and gene expression with different sets of the time points because of the different time scales of temporal changes in signaling molecules and gene expression (Fig 4). Using the unequally spaced time series data with the different sets of the time points, we performed the system identification using integration of signal recovery and the NARX model (Fig 5A-5C).
Using these time series data sets, we selected three sets of Inputs-Outputs combinations from upstream to downstream and performed system identification for each set ( Fig 5A). The system identification consists estimating the I-O relationship, dose-response by the Hill equation, and gain and time constant by the linear ARX model (Fig 3, see "Calculation of gain and time constant from the linear ARX model" section in Materials and methods). We selected pERK and pCREB as Input candidates for each Output, c-Jun, c-Fos, and Egr1, based on previous studies [8][9][10] (Fig 5A). We selected pERK, pCREB, c-Jun, c-Fos, and Egr1 as Input candidates for each Output, FosB and JunB [8][9][10] (Fig 5A). We selected pERK, pCREB, c-Jun, c-Fos, Egr1, FosB, and JunB as Input candidates for each Output, Metrnl, Ser-pinb1a, and Dclk1 [9] (Fig 5A).
For c-Jun and Egr1, pERK was selected as an Input, and for c-Fos, pERK and pCREB were selected as Inputs (Fig 5). For FosB, c-Jun and c-Fos were selected as Inputs, and for JunB, pCREB was selected as Input ( Fig 5B). For Metrnl, FosB, c-Fos and JunB were selected as an Inputs; however, contributions of c-Fos and JunB were negligible (S2 Fig), indicating that FosB is a main Input for Metrnl. For Serpinb1a and Dclk1, JunB was selected as an Input (Fig 5B). It is noteworthy that FosB and JunB, but not signaling molecules and other IEGs, were mainly selected as Inputs of the LP genes and the inputs for Metrnl and Dclk1 were different despite their similar temporal patterns.
We characterized the dose-response by the Hill equation and gain and time constant by the linear ARX model (Fig 5C, Table 1). The dose-responses from c-Jun and c-Fos to FosB showed typical switch-like responses, whereas others showed graded or weaker switch-like responses. Note that the gain from the converted c-Jun to FosB was much smaller than that from the converted c-Fos (Table 1), indicating that FosB is mainly regulated by c-Fos but not c-Jun. The time constants for c-Jun, c-Fos, Egr1, Metrnl, and Dclk1 were less than 1 h, whereas those for FosB, JunB, and Serpinb1a were more than 100 min (Table 1), indicating that induction of FosB and JunB temporally limit the overall induction of the LP genes from signaling molecules. The transformation of Inputs by the Hill equation followed by the ARX model is shown in S2 Fig. In addition, when we integrated these three sets of the NARX

Prediction and validation of the identified system by pharmacological perturbation
We validated the identified system by pharmacological perturbation. One of the key issues in PC12 cell differentiation is whether ERK or CREB phosphorylation mediates expression of the downstream genes [9,11,47]. Therefore, we selectively inhibited ERK phosphorylation by a specific MEK inhibitor, trametinib [48][49][50] in PACAP-stimulated PC12 cells. We found that PACAP-induced ERK phosphorylation, but not CREB phosphorylation, was specifically inhibited by trametinib (Fig 6A, black dots).
For c-Jun, c-Fos, Egr1, and JunB, we recovered signals of the unequally spaced time series data of Inputs and Output. For c-Jun, c-Fos, and Egr1, we simulated Outputs responses using these recovered data and the identified NARX model (Fig 6A black lines, see also "Simulation of the integrated NARX model" section in Materials and methods). For other downstream molecules, FosB, JunB, Metrnl, Serpinb1a, and Dclk1, we used the recovered data of pCREB and the simulated time series data of c-Jun, c-Fos, and Egr1 as Inputs for the identified NARX model (Fig 6A, black lines). The simulated time courses of Outputs were similar with those in experiments, except those of FosB and Metrnl (Fig 6A, black lines and black dots). In the simulation, FosB and Metrnl did not respond to PACAP in the presence of trametinib, whereas in the experiment both molecules did so, suggesting the possibility of failure of the system identification of FosB and/or Metrnl. Therefore, we investigated whether FosB and Metrnl can be reasonably reproduced when experimental and recovered data of c-Fos and c-Jun and of FosB, respectively, were used rather than the simulated ones ( Fig 6B). When experimental and recovered data were used as Inputs, Metrnl, but not FosB, responded to PACAP in the presence of trametinib both in the simulation and experiment (Fig 6B), indicating that the failure of the system identification of FosB and Metrnl in Fig 6A arose from the failure of the system identification of FosB. Thus, all Outputs except FosB showed similar responses in the experiment and simulation when the experimental and recovered data were used as Inputs, indicating that in most cases the identified system is validated by pharmacological perturbation.

Discussion
In this study, we identified the system from signaling molecules to gene expression using the unequally spaced time series data for 720 min after the stimulation. Given that expression levels of the LP genes were highly correlated with neurite length regardless of growth factors [9] and expression continues for 720 min after the initial addition of NGF [12], the identified system is the selective growth factor-signaling decoding system for neurite length information, one of the most critical steps for cell differentiation in PC12 cells. We found that the LP genes depends only on the IEGs (c-Fos, FosB and/or JunB) but not other upstream molecules, and that the decay rates of the LP genes are fast. This means that the timing of the final decoding System identification of signaling dependent gene expression step for neurite elongation is not directly determined by the IEGs and LP genes, rather by the steps from growth factors to the late IEGs.
We previously identified the systems leading from pERK and pCREB to the IEGs using the equally spaced dense time series data with a uniform 3-min interval during 180 min [10]. The identified I-O relationships in this study are the same, except for the inputs of FosB and JunB. In this study, for FosB c-Jun was selected as an Input in addition to c-Fos. However, the gain from the converted c-Jun to FosB was much smaller than that from the converted c-Fos (Table 1), indicating that the effect of c-Jun is negligible. For JunB, c-Fos was not selected as an Input in this study, whereas c-Fos was selected in the previous study [10]. The gain from the converted c-Fos to JunB at lower frequency was much smaller than that from the converted pCREB [10], indicating that the effect of c-Fos is negligible in the previous study. Thus, the identified I-O relationships in this study are consistent with our previous work.
The estimated NARX parameters were also generally consistent with those in our previous study [10]. The peak of c-Fos by NGF stimulation was approximately 0.9, whereas it was approximately 0.6 in our previous study [10]. The difference may come from the difference in the algorithm for parameter estimation, because of the procedure of signal recovery is included in this study. Overall, the inferred the I-O relationship, the Hill equation, and the linear ARX model in this study are consistent with our earlier observations, indicating that system identification using unequally spaced time series data can give the same performance as using equally spaced time series data and that the system is time invariant during 720 min. Furthermore, the identified system can reasonably reproduce the time series data using extrapolated data with dots, experimental and recovered data. Experimental and recovered data of pERK and pCREB, and the simulated data of c-Jun, c-Fos, Egr1, FosB, and JunB are given as Inputs, and simulation was performed using the NARX model in Fig 5 (see "Simulation of the integrated NARX model" section in Materials and methods). In the experiment, PC12 cells were treated in the absence (blue dots) or in the presence (black dots) of trametinib (10 μM) added at 30 min before stimulation with PACAP (100 nM). Note that the PACAP stimulation data are used, as in Fig 4. (B) Simulation using experimental and recovered data as Inputs. For each set of the Inputs (left panel for each) and Outputs (right panel for each), the unequally spaced time series data were recovered (pluses) (right panel for each), and the responses of Outputs were simulated by the NARX model identified in Fig 5A-5C  trametinib, except for FosB. Taken together, these results demonstrate the validity of the predicted response of the identified systems. The reason for the failure of system identification of FosB is unclear, but it may reflect the failure of parameter estimation due to the insufficient number of experimental data points, the limitation of the NARX model structure, or the existence of unknown regulatory molecules. Further studies are necessary to address these issues. Some of the recovered time series data is noisy such as Metrnl (Fig 5B), and these are possibly remedied by transforming the output of ARX model with the Hill equation instead of the inputs. However, doing so may result not only in an increasing of computational cost, but also in mathematically another two problems. First, the signal recovery cannot be applied within the present framework based on IPMS. Second, the parameter estimation of ARX model becomes harder. The problems are essentially due to the difficulty of optimization, caused by the fact that the tuning parameters are inside nonlinear function. We carried out multipleinput and single-output (MISO) system identification in this study. As another improvement approach of this study method, there is multiple-input and multiple-output (MIMO) system identification. Although MIMO can be applicable, there is the combination problem between the outputs. In this study, our MISO system identification was calculated in 452 combinations, but the combinations for MIMO will be increased to 1003 combinations. Also, as the number of parameters increases about twice, the cost of parameter estimation becomes more expensive. Therefore, we employed MISO system identification in this study. Both the MIMO approach and transforming outputs of ARX model are future works.
One key issue is whether ERK and/or CREB mediates cell differentiation through downstream gene expression in PC12 cells [9,11,47]. We previously found that the LP genes are not induced by NGF in the presence of U0126, another MEK inhibitor [9], and that the MEK inhibitor blocks NGF-induced phosphorylation of both ERK and CREB in PC12 cells [8,9,51]. By contrast, the MEK inhibitor blocked phosphorylation of ERK, but not CREB, in PACAP-stimulated PC12 cells [8,51] (Fig 6), suggesting that PACAP induces phosphorylation of CREB through a cAMP-dependent pathway, rather than the ERK pathway. These results demonstrate that NGF selectively uses the ERK pathway, whereas PACAP selectively uses the cAMP pathway for induction of the LP genes. Considering that LP genes are the common decoders for neurite length in PC12 cells regardless of growth factors [9], the identified system in this study (except for FosB) reveals the selective NGF-and PACAP-signaling decoding mechanisms for neurite length information.
Recently, fluorescence resonance energy transfer probes, optogenetics, and microfluidic devices have been developed to achieve observation and time control of ERK phosphorylation temporal patterns. These methods allow us to focus on quantitative relationships between various ERK phosphorylation temporal patterns and phenotypes such as cell differentiation [13,21,29,[52][53][54][55]. Although the relationship between signal transduction and phenotype has been extensively studied, it remains unclear how the signaling molecules quantitatively regulate the downstream gene expressions over a longer time scale, leading to cell fate decisions. In this study, we revealed the quantitative regulatory mechanism between signaling activation at a short time scale (tens of minutes) and gene expression at a longer time scale (day) by using a system identification method integrating a signal recovery technique and the NARX model based on compressed sensing.
A linear or spline interpolation is often used to convert unequally spaced time course data into equally spaced time course data in biological data analysis. However, such interpolation methods are not likely to be reliable because the interpolation methods ignore biochemical property of molecular network. By contrast, the interpolation used in this study is based on the NARX model, which reflects biochemical property. Thus, the proposal method in this study is biologically more plausible than a linear or spline interpolation.
There are obscure points for an application of this method to biological data analysis. The relationship between a number of observed time points and accuracy of signal recovery is theoretically unknown. In addition, how to select time points is also unknown. Intuitively, dense time points may be required for transient response, while sparse time points may be sufficient for sustained response. Further study is necessary to address this issue.
In molecular and cellular biology, molecular networks-the I-O relationship in this studyare generally examined by gene disruption or pharmacological perturbation experiments, meaning that the I-O relationship is examined using static and qualitative data. In this study, we used Inputs-Outputs time series data for system identification, allowing us to determine the I-O relationship using dynamic and quantitative data. Our system identification method does not require detailed knowledge of pathway information, which means that it can be used as a pathway finder directly from time series data. Moreover, additional information such as sensitivity with graded or switch-like response, time delay, and gain can be obtained. One of the advantages of using time series data rather than static perturbation experiments is simultaneously obtaining the I-O relationship, sensitivity, and time constant, which characterize the system behavior. This is based on the idea that input-output time series data implicitly include information on the I-O relationship. However, we must note that the I-O relationship obtained by using this method may be an apparent relationship inferred from time series data and is not necessarily the direct I-O relationship. Therefore, the obtained I-O relationship should be validated by gene disruption or pharmacological perturbation experiments, as shown in Fig 6. In addition, synergy induced by cross talk between signaling molecules is one of the most important properties in signaling network. The NARX do not assume synergy induced by cross talk. However, incorporation of synergy causes combinatorial problem and increases computational cost. Such synergistic effect should be incorporated in the future. To summarize the above points, as caveats when using NARX model, estimated I-O relationships is phenomenological relationship, rather than direct interaction (Table 2). Furthermore, the possibility that unknown molecules are upstream inputs cannot be ruled out. On the other hand, if the pathway is unknown and the prior knowledge is not available, the model candidates can be prepared systematically in the NARX model, which is suitable for estimating I-O relationships. In particular, in the case of estimating I-O relationships including nonlinear biochemical reactions without prior knowledge like in this study, the estimation of I-O relationships by ODE model is difficult and NARX model is useful. We also compared the parameter estimation cost between the NARX model for c-Jun, c-Fos and Egr1 and ODE model in our previous study because the model size can be considered similar [56], indicating that the parameter estimation cost of NARX model in this study is relatively small. In addition to that, it is possible to apply to unequally spaced time-series data in the NARX model developed in this study. Although it is difficult to obtain the direct mechanistic interactions by the NARX model, the result of the I-O relationships and time constants provide biological insight of decoding mechanism of the upstream molecules by the LP genes. In conclusions, we have devised a system identification method using unequally spaced sparse time series data by signal recovery. Because of technical and budget limitations in biological experiments, it is generally difficult to obtain sufficient numbers of equally spaced dense time series data of molecules with different time scales. Thus far, system identification based on time series data has been limited to phenomena with similar time scales. However, our system identification method can solve this time-scale problem and can be applied to any biological system with different time scales, such as the cell cycle, development, regeneration, and metabolism involving ion flux, metabolites, phosphorylation, and gene expression.

Reagent or resource
Information about reagents and resources can be found in Table 3.

Cell culture, QIC and qRT-PCR analysis
Culture and stimulation of PC12 cells by growth factors [10] and QIC [45] were performed as previously described. Quantitative reverse transcription-polymerase chain reaction (qRT-PCR) was performed as previously described [9]. The sequences of the primers for the LP genes are shown in S2 Table [9].

Parameter estimation
Parameter estimation was performed by using 2.6 GHz CPU (Xeon E5 2670) of the super computer system of the National Institute of Genetics (NIG), Research Organization of Information and Systems (ROIS). The CPU time was 7.9 hours for the parameter estimation of the NARX model for c-Jun, c-Fos and Egr1, and that of ODE model in our previous study which can be considered similar model size was 25 hours in the same condition [56] (Table 2).

NARX model and data representation
Assuming that the input molecules (Input) and output molecules (Output) signals satisfy the following NARX model, Eqs (1) and (2), the system identification is performed by estimating unknown parameters in the NARX model, where u q;s k and y p;s k are experimental values of Input and Output at time step k, p and q respectively denote indices of Output and Input defined in the following sets, M p is the index set of Input defined for each Output p 2 P (Fig 5A). The nonlinear function f (x) in Eq (2) is the Hill equation that is one of the steady state solutions of biochemical reaction and widely used in the field of biology [27]. The coefficients a p i and b p j , the orders m y and m u in Eq (1), n p and K p in Eq (2), and set M p are unknown parameters. For each molecule under stimulation condition s (NGF, PACAP, and PMA), the unequally spaced time series data are obtained in this study. We consider them as equally spaced time data u q;s k and y p;s k with missing time points and identify the unknown NARX parameters after recovering missing time points based on the low rankness of the Hankel-like matrix, which is described in the next section. System identification of signaling dependent gene expression Extension of ARX system identification from unequally spaced time series data to the nonlinear ARX system To deal with the nonlinear ARX system, we extend ARX system identification from unequally spaced time series data to the nonlinear ARX system. First consider the simple case of the linear ARX model and then extend it to the NARX model. To perform system identification from unequally spaced time series data, equally spaced time series data are generated by signal recovery of unknown values of missing time points. Because the Input and Output data of a linear system are missing and the order of the system is unknown as in this study, the system identification using the recovered Input and Output data based on the low rankness of the Hankel-like matrix has been proposed [42]. This is a method to recover missing data by solving the matrix rank minimization problem and to generate equally spaced sampled data. In this study, we apply this matrix rank minimization approach to simultaneously identify the NARX model and recover missing data. First, for simplicity, let us consider the case of the linear ARX model with single Input and single Output described by where y k and u k are the Output and Input at time step k, and v k is the noise. When only fu k g k2O u and fy k g k2O y are obtained, that is, the part of the Input and Output data fu k g N k¼1 and fy k g N k¼1 , we consider the problem of recovering unknown Input and Output data. Here, O u and O y are index sets and are a subset of the set {1,2,. . .,N}. We define Hankel-like matrices Y and U by Eqs (7) and (8), where we assume that N is sufficiently larger than r.  : Hankel-like matrices Y and U are matrices called Hankel matrices if they are square matrices, and they are matrices in which the same components are entered from the lower left to the upper right in the matrix. Considering v k = 0 in Eq (6), that is, considering an ideal case without noise, Eq (9) holds for the matrix [Y U] in which the matrices Y and U are arranged horizontally (S1B Fig).
Thus, the matrix [Y U] is a low-rank matrix whose rank is determined by the order of the system. If m y is known in Eq (9), the missing data can be recovered by restoring the unknown components of the matrix so that the rank of the matrix [Y U] becomes m y + r. Because the order m y is unknown in this study, we recover the unknown components so as to minimize the rank of the matrix [Y U] based on the idea that it is better to describe the system with as few parameters as possible. That is, the missing data are recovered by solving the matrix rank minimization problem as follows, where " y k and " u k are observed values. Eq (10) is a nonconvex optimization problem, which is generally a Non-deterministic Polynomial time (NP)-hard problem in the field of the computational complexity theory. Therefore, we relax this problem in Eq (11) in which the objective function is replaced by the nucleus norm, the sum of the singular values of the matrix, and obtain a low-rank matrix by solving this optimization problem with the iterative partial matrix shrinkage (IPMS) algorithm [41].
where kÁkÃ ,r represents the sum of singular values that are smaller than the rth greater singular value. The IPMS algorithm is a technique to provide a low-rank solution of Eq (10) by solving Eq (11) repeatedly for increasing r by 1, starting at r = 0, and provides recovered data with small energy loss after recovery and less distortion by preferentially estimating from a singular value of a large value [41].
In the case of a multi-Input system, for each Input, a Hankel-like matrix U l corresponding to the matrix U is generated, and by solving the matrix rank minimization problem of matrices arrayed side by side such as [Y U 1 . . . U L ], Inputs and Output data can be similarly recovered. Also, when data under multiple stimulation conditions are obtained, Input and Output data can also be recovered by arranging the matrices vertically for each stimulation condition. For example, when there is a data set of NGF stimulation and PACAP stimulation and simulation condition s is s 2 {NGF,PACAP}, a matrix composed of y s k and u s k is vertically arranged for each stimulation condition s to construct Y and U, and Input and Output data can be recovered by solving the matrix rank minimization problem for [Y U].  (14) and (15) is recovered again by the IPMS algorithm. Note that during IPMS process, AIC not but RSS is used because numbers of lag order change due to the change of matrix rank.
Step v: Select NARX parameters with the minimum AIC training . Repeat steps i to iv 500 times. Select n p and K n p p and ARX parameters that minimize AIC training .
Step vi: Signal recovery of test data. Using the n p , K n p p and ARX parameters selected in step v, add test data to the recovered matrix [Y U] in Eqs (14) and (15) like in Eqs (16) and (17). Test data are also recovered by solving the test data added matrix [Y U] rank minimization problem with the IPMS algorithm. Note that training data sets have already been recovered until step v. Therefore, with the training data fixed, IPMS was applied to the matrix combining the test data, and signal recovery of only test data is performed in this step.  (6), which can be re-described as follows, y k À a 1 y kÀ 1 À a 2 y kÀ 2 À Á Á Á À a m y y kÀ m y ¼ b 1 u kÀ 1 þ b 2 u kÀ 2 þ Á Á Á þ b kÀ m u ; ð18Þ

½Y
and its Z-transform are given by ðy k À a 1 z À 1 À a 2 z À 2 À Á Á Á À a m y z À m y ÞyðzÞ Then a discrete-time transfer function, a function to convert Input to Output through the system, G(z) can be described using these ARX parameters, To consider the frequency response function and calculation of gain and phase, z is substituted by iω, where i is an imaginary unit and ω is frequency. Therefore, gain and phase can be calculated from ARX parameters. The frequency response curve and phase diagram at each Input and Output of the identified linear ARX model are shown in S4 Fig. Note that gain indicated in Table 1 is steady-state gain. From the frequency response function, cutoff frequency f cutoff , an inverse of time constant τ, is obtained by calculating the frequency at which the gain corresponds to 1 ffi ffi 2 p of the steady-state gain. Because Eq (23) is established between f cutoff and the time constant τ, we can obtain τ from the ARX parameters through the above procedure.
Simulation of the integrated NARX model The simulation of the integrated NARX model was performed as follows. Experimental and recovered data of pERK and pCREB, and the simulated data of c-Jun, c-Fos, Egr1, FosB, and JunB were given as Input data and simulation was performed using the NARX model in Fig 5. Supporting information S1 Fig. Estimation of AR or ARX parameters by rank minimization of Hankel-like matrix.
(A) Estimation of AR parameters by rank minimization of Hankel-like matrix. When the signal y follows an AR model, y 1 is represented by the linear sum from y 2 to y r where r is lag order. This relationship can be expressed using AR parameters as the first row of the above matrix equation. Similarly, y 2 is represented by the linear sum from y 3 to y r+1 as expressed in the second row. Therefore, for number of data N, the matrix equation is established as shown in the above formula. Given that this matrix equation holds, in the Hankel-like matrix below, the first column is represented by the linear sum of the second and subsequent columns, indicating that the matrix is a low-rank matrix. Note that Hankel-like matrix is a one in which the same components are entered from the lower left to the upper right here. Therefore, assuming that y follows an AR model, in case that there is missing data in y, it can be recovered by minimizing the rank of this Hankel-like matrix. Once the Hankel-like matrix is obtained, AR parameters can be estimated by solving the above matrix equation. (B) Estimation of ARX parameters by rank minimization of a Hankel-like matrix. In the case of ARX model with input u added, the similar Hankel-like matrix composed of y and a Hankel-like matrix composed of u, and the matrix equation of the above equation are established. Assuming that y and u follow the ARX model, missing data in y and u can be recovered by minimizing the rank of this Hankel-like matrix, and ARX parameters can be estimated by solving the above matrix equation. Note that this method can be applicable for multiple inputs by increasing the number of Hankel-like matrices composed of u.  Table 1. Filter characteristics of frequency response curves of these outputs showed lowpass filter characteristics. (TIF) S1