Identification of a Metabolic Reaction Network from Time-Series Data of Metabolite Concentrations

Recent development of high-throughput analytical techniques has made it possible to qualitatively identify a number of metabolites simultaneously. Correlation and multivariate analyses such as principal component analysis have been widely used to analyse those data and evaluate correlations among the metabolic profiles. However, these analyses cannot simultaneously carry out identification of metabolic reaction networks and prediction of dynamic behaviour of metabolites in the networks. The present study, therefore, proposes a new approach consisting of a combination of statistical technique and mathematical modelling approach to identify and predict a probable metabolic reaction network from time-series data of metabolite concentrations and simultaneously construct its mathematical model. Firstly, regression functions are fitted to experimental data by the locally estimated scatter plot smoothing method. Secondly, the fitted result is analysed by the bivariate Granger causality test to determine which metabolites cause the change in other metabolite concentrations and remove less related metabolites. Thirdly, S-system equations are formed by using the remaining metabolites within the framework of biochemical systems theory. Finally, parameters including rate constants and kinetic orders are estimated by the Levenberg–Marquardt algorithm. The estimation is iterated by setting insignificant kinetic orders at zero, i.e., removing insignificant metabolites. Consequently, a reaction network structure is identified and its mathematical model is obtained. Our approach is validated using a generic inhibition and activation model and its practical application is tested using a simplified model of the glycolysis of Lactococcus lactis MG1363, for which actual time-series data of metabolite concentrations are available. The results indicate the usefulness of our approach and suggest a probable pathway for the production of lactate and acetate. The results also indicate that the approach pinpoints a probable strong inhibition of lactate on the glycolysis pathway.


Introduction
Understanding metabolic pathways allows us to control metabolism, design a better metabolic system and optimise productivity. In vitro, in vivo and in silico research has been used to reconstruct the set of reactions that compose metabolic networks and their regulatory structure. However, it is still challenging to predict an unknown metabolic reaction network both experimentally and theoretically. For example, an in vitro experimental technique based on enzyme assays [1] can elucidate whether enzymes are inhibited or activated via interaction with metabolites, resulting in the clarification of a metabolic reaction network. However, this technique is costly, tedious and timeconsuming because each enzyme activity needs to be measured individually in in vitro experimental systems specifically optimized for respective enzymes. Thus, it may be difficult to apply this technique to a large-scale metabolic system. On the other hand, time-dependent changes of metabolite concentrations can be determined in vivo [2] and a large amount of metabolomics data have been reported from the utilisation of high-throughput analytical instruments [3]. Canonical correlations and multivariate analysis are often used to analyse those metabolomics data.
However, while correlations of metabolites have been successfully acquired, a network structure of the correlated metabolites remains unidentified.
Because of the experimental constraints, systems biology approaches are recently considered to be one of the alternatives for handling metabolomics data and analysing metabolic systems. Specifically, the mathematical modelling approaches have been exploited to analyse metabolic reaction networks [4]. In reality, however, information on metabolic reaction networks, metabolite concentrations and parameters such as rate constants and kinetic orders are required to construct an appropriate model. A wellknown method is the utilisation of Michaelis-Menten type equations that express rates of enzymatic reactions [5]. However, it is not easy to identify each type of reaction because of in vitro experimental constraint mentioned above. Biochemical systems theory (BST) is an alternative method of analysing enzymatic reactions in network systems [6][7][8]. This theory provides a simple method for constructing a mathematical model once a network structure is available as a metabolic map, and it requires fewer parameters. Several techniques for estimating better parameter values have been proposed [9][10][11]. However, these techniques require a known metabolic reaction network for parameter estimation.
To overcome these difficulties, therefore, the present study explores a new approach for identifying a metabolic reaction network and simultaneously constructing a mathematical model. The approach consists of statistical and mathematical modelling techniques. The main concept of this approach is to employ timeseries data to determine the structure of the metabolic reaction network. In principle, metabolites probably relate to others in a complicated network. A perturbation of a metabolite concentration causes changes in other metabolite concentrations. Thus, if changes in the time courses of metabolite concentrations are analysed, it becomes possible to predict and understand their metabolic reaction network. The present work therefore proposes such a new approach based on this idea.

Generic inhibition and activation model
The proposed algorithm is presented in Figure 1. As the real experimental data usually contain both biological and analytical errors, the analysis starts with smoothing noisy time-series data using locally estimated scatter plot smoothing (LOESS). Then, bivariate Granger causality is calculated to examine causal relationships between all pairs of metabolites, and unrelated metabolite pairs are removed from further consideration. A mathematical model is then formulated in S-system representation in the framework of biochemical systems theory (BST) by taking into consideration effects between all remaining metabolite pairs, followed by parameter estimation using nonlinear least-square method, namely Levenberg-Marquardt algorithm (LMA). The iterations from the mathematical modelling step to parameter estimation (BST to LMA) are simulated and a most insignificant metabolite is removed one by one in each iteration step. Finally, a probable metabolic network is identified.
To validate if the algorithm is applicable, we start the study using a known metabolic reaction network, i.e., the generic inhibition and activation model ( Figure 2A). This model has been widely employed as a metabolic case study in the development of parameter estimation techniques [12] because it imitates characteristics of a real metabolic pathway which includes a branching point and both inhibition and activation effects. Firstly, the timeseries data for the metabolites X 1 -X 4 were generated at 51 time points by using the mathematical model with parameter values described in equations 1-4. For the preliminary study, we consider a case without noise to properly evaluate the performance of the proposed approach. Therefore, the step of data smoothing by LOESS was not used.
where X i are metabolite concentrations. a i and b i are rate constants of net influxes and effluxes, and g ij and h ij are their kinetic orders.
The time-series data in silico generated are plotted in Figure 3. The behaviour of metabolite concentrations is quite different to each other. It is therefore difficult to predict the relationship between the metabolite concentrations. To calculate correlation coefficient between metabolite concentrations, the normality distribution of each metabolite concentration was tested (data not shown). The result shows that the time-series data do not have normality (p-value,0.05), and the Spearman's rank correlation coefficient should be used to calculate the correlation coefficient.  However, to broadly observe correlations among metabolite concentrations, we simply calculated both the Pearson's correlation coefficient and Spearman's rank correlation coefficient according to common methods for acquiring correlations of metabolite concentrations. The results, shown in Table S1, S2 in Information S1, indicate good positive and negative correlations between several metabolites. However, this finding does not indicate causal relationships between metabolites and effects of one metabolite on the other.
To obtain more information on network properties, the bivariate Granger causality test was executed to investigate relationships among metabolites. Table 1 tabulates the result of the bivariate Granger causality test for the generic inhibition and activation model. In theory, the Granger causality of a metabolite on itself cannot be calculated; hence, these data are not available. The result indicates that all p-values are much lower than a significance level of 0.01. This may be partly because we did not adjust the time lag (u) in equation 13 (see Methods) but retained its value as u = 1, implying that the present data point was used for predicting the value at the next time point. On the other hand, changes in the values of the time lag may have an effect on the Granger causality result. However, the p-values are still lower than the significance level of 0.01, although their value increases as the time lag increases (Table S3, S4, S5, S6, S7 in Information S1). Thus, we consider the time lag that maximises significance and set this lag to unity throughout the simulations. Only the data point at time t-1 was considered for predicting the value of the data point at time t. Furthermore, the Granger causality may give some false positive interactions if a network is very large, but it is not considered to be a serious problem here, since we perform this calculation only for finding the highest causality and removing unnecessary metabolites.
From Table 1, it seems that each metabolite is Granger-caused by other metabolites. Hence, all metabolites must be considered in the next step calculation. The S-system equations were constructed and all metabolites were considered in the equations for both influxes and effluxes. It is possible to fit the metabolite concentrations using polynomial equations or sigmoidal curves and then calculate the slope values from the derivative of their equations. However, it should be noted that the concentrations are functions of time. This implies that even though one can calculate such slopes, these values may be different from their exact slope values directly calculated from S-system differential equations, because the exact slope values are functions of time and other metabolites. To make our approach practical, we calculated the slope values from neighbouring time-series data of metabolite concentrations. The differential equation for each metabolite was individually set as an objective function for parameter estimation.
The performance of LMA for estimating parameters in a wellknown model was investigated before it is used in our algorithm. The results are given in Table S8 in Information S1. LMA finds only a local minimum, not a global one. It is therefore necessary to verify whether this non-linear regression method can successfully converge when power-law equations are used. Exact slope values from S-system equations were selected for this validation. The initial values for both rate constants and kinetic orders were set at unity. The results show that the parameters that converge using the exact slopes (estimated parameters b) are identical to their respective actual parameter values. This indicates that convergence behaviour of our parameter estimation procedure performs very well, especially for this system. In contrast, when the slopes were calculated from neighbouring data points, the converged parameter values (estimated parameters c) are slightly different from their true values. This is natural because these slopes were directly calculated and are not a function of other metabolites, unlike in the former case. In actual experiments, however, such exact slopes are not obtained and only the metabolite concentrations are available. Nevertheless, both sets of estimated parameters provide similar characteristics in terms of the behaviour of metabolite concentrations.
LMA provided fast convergences although the initial parameter values which were set to be unity are far from the true parameter values. The convergence times were calculated using GNU octave version 3.2.4 on Windows 7 platform with 2.93 GHz CPU. The convergence times of X 1 , X 2 , X 3 and X 4 with the exact slope values were 0.119, 0.176, 0.319 and 0.087 s, respectively, whereas those with the slope values calculated from neighbouring data were 0.120, 0.169, 0.382 and 0.090 s, respectively.
Once the performance of LMA was successfully elucidated, we exploited it to our algorithm. Assuming that the network is unknown, the S-system equations (equation 15) were set up and all parameters for all metabolites were primarily considered. Table 2 shows the first parameter iteration values obtained by LMA. It is clear that absolute values of some kinetic orders are very low compared with other parameters. The low absolute parameter values are considered to have little effect on the current system. The metabolites with such kinetic orders were thus removed one by one. New equations were re-organised and the parameter estimation by LMA was iterated. The results are shown in Information S2. Again, parameters quickly converged to their solutions. For the first iterations, the convergence times of X 1 , X 2 , X 3 and X 4 were 6.08, 11.9, 5.28 and 3.85 s, respectively. The  Table 1. Bivariate Granger causality test for the generic inhibition and activation model a . convergence times also decreased with a decrease in the number of parameters.
The significantly large kinetic orders in Table 2 (more detail in Information S2) can be used to identify a metabolic reaction network. Although the metabolites with smaller kinetic orders may have some effect, the metabolites having large effects will probably be neighbouring metabolites in the metabolic pathway or metabolites that strongly inhibit or activate the metabolite of interest. Thus, the metabolites having large effects were selected for identification of an actual metabolic reaction network. Figure 2B shows the metabolic reaction network identified from the converged results. The predicted network structure with the equations derived using our procedure is consistent with the original network structure in Figure 2A [9,12]. A correct mathematical model and its parameters were also obtained simultaneously, as shown in Information S2. This suggests that our approach not only identifies a metabolic reaction network but also provides an appropriate mathematical model.
Although there is a constraint for using the bivariate Granger causality and also the parameter estimation using slopes may give slight calculation errors in the model construction, the above result clearly shows that our approach is theoretically consistent. Furthermore, it can provide a mathematical model for system analysis, although most of the systems biology approaches focus on either data analysis or model construction. On the other hand, actual experimental data contain biological and analytical errors and it may be difficult to obtain a large amount of time-series data. To evaluate the performance of our approach in practical application, therefore, the number of the time-series data for each metabolite concentration in the generic inhibition and activation model was decreased to 11 points and each data was allowed to include a noise in the range of 0-5% (Information S3). The result shows that it is still possible to estimate the metabolic reaction network if the time-series data set possesses clear characteristics and behaviours. It is therefore clear that our approach more depends on the quality of data than the quantity of data.

Simplified model of glycolysis of Lactococcus lactis
We next discuss the glycolysis pathway of Lactococcus lactis because a number of metabolite concentrations have been reported for several types of micro-organisms genetically modified or perturbed both in vitro and in vivo [14][15][16]. The time-series data of metabolite concentrations for Lactococcus lactis MG1363 were taken from a number of studies [2,17,18].
According to these studies, several metabolite concentrations, such as phosphoenolpyruvate and phosphoglyceraldehyde, contain significant experimental errors, and it is difficult to validate the results. Consequently, these experimental data were neglected. In contrast, metabolites that have clear metabolic behaviours despite containing large experimental errors were considered here. The current system thus consists of five metabolites, including three extra-and two intra-cellular metabolites.
We fitted the measured time-series data of metabolite concentrations obtained in Neves et al. [2,17,18] by LOESS. The parameters that control the degree of smoothing were arbitrarily adjusted (Table S18 in Information S4). The estimated time-series data of metabolite concentrations were produced from the results fitted by LOESS at time intervals of 1 min. Fifty-one data points for each metabolite concentration can be seen as lines in Figure 4.
The bivariate Granger causality for these estimated time-series data was calculated. The results are listed in Table 3. It is clear that some metabolites do not Granger-cause other metabolites (pvalues.0.01) whereas others do. For instance, there exist high Granger-causes of X 2 to X 1 , X 4 to X 2 , X 1 to X 3 , X 2 to X 4 and X 3 to X 5 .
A procedure to construct a metabolic reaction by Granger causality is as follows. First, the influx to X 1 (glucose) is not considered because it is the starting compound. Second, effluxes from X 4 (lactate) and X 5 (acetate) are also not considered because they are end products. Third, the metabolites that have insignificant Granger causalities are removed. Fourth, the metabolites having first and second Granger causalities are considered. As a result, it is possible to predict a pathway from the Granger causality, as illustrated in Figure 5, where the solid lines express the most significant causality for each metabolite and the broken lines express the second most significant causality. Although the Granger causality is useful for approximately understanding the metabolic network structure, it is still not enough to identify the actual structures because of insufficient information.
To more accurately predict the metabolic network structure, the mathematical modelling approach was repeatedly used right after the time-series data were statistically analysed. S-system equations (equation 15) were set up and parameters were estimated using LMA. In this case, the parameters for the metabolites which do not significantly Granger cause the other metabolites (underlined in Table 3) were removed or set to be zero. The remaining metabolites in the Granger causality test were then included in the influx terms of the S-system equations, whereas all metabolites were included in the efflux terms. The metabolites with the lowest kinetic order were removed one by one at each iteration step. Although a particular metabolite may have little effect on its efflux, it must be considered in the efflux term because the efflux is influenced by the metabolite. In addition, the metabolites with the highest Granger causality must be considered in the entire calculation because they are statistically significant. Table 4 presents the parameters that were determined by LMA in the glycolysis model and the predicted model (equations 5-9) are described as follows: where X i are metabolite concentrations. a i and b i are rate constants of total influxes and effluxes, respectively, and g ij and h ij are their kinetic orders. Y i represent unknown parameters assigned for both rate constants and kinetic orders. Table 4 also includes the R-squared values in each removal process. As the number of iterations increases, the R-squared value usually decreases from unity (Information S4). A low R-squared value implies that the values calculated using the reconstructed equations do not fully agree with the experimental data. This is natural because the degree of fitness is lowered as a result of the reduction in the number of parameters. Figure 6A shows the metabolic reaction network predicted using the remaining parameters. Glucose (Glu) is converted to glucose-6phosphate (G6P), which is successively converted to fructose-1,6bisphosphate (FBP). This agrees with the structure of the actual glycolysis pathway acquired from KEGG ( Figure 6B). Interestingly, our approach suggests that G6P has a pathway, allowing it to be converted to lactate (Lac) and acetate (Ace). This pathway could be regarded as the part of pentose phosphate pathway, although the flux through this pathway is not high [19]. Our approach further suggests that Lac strongly inhibits the formation of G6P. This interaction is related to the inhibition of acids on cells [20].
Identification of the probable metabolic reaction network automatically leads to the formulation of a mathematical model in the S-system. As shown in Figure 7, the values calculated by the mathematical model are in agreement with the experimental ones, implying that our approach has good performance.
To further verify whether the mathematical model is appropriate, we calculated the instantaneous bottleneck ranking (BR) indicator defined as This indicator is a product of the logarithmic gain L(X i ,Y j ) and the metabolite concentration X i and provides the time-transient response of the dependent variable X i to an infinitesimal percentage change in the independent variable Y j [21]. A positive value of the instantaneous BR indicator indicates that an increase in an enzyme activity increases the relevant metabolite concentration from its initial concentration, whereas a negative value indicates that an increase in the relevant enzyme activity decreases the relevant metabolite concentration. Figure 8 shows the time courses of the instantaneous BR indicators for lactate (X 4 ) and acetate (X 5 ) after the individual rate constants and kinetic orders (Table 4) increases at t = 0 (additional information is available in Table S24 in Information S4). Overall, the BR indicators for the lactate concentration increase or decrease more significantly than do those for the acetate concentration, suggesting that lactate is more easily formed than acetate. The difficulty in the formation of acetate arises because the flux for lactate formation is higher than that for acetate formation. Moreover, ranking of enzymes based on the BR indicators reveals that the bottleneck enzyme for lactate formation is Y 2 when it is increased and Y 7 when it is decreased, while that for acetate formation is Y 3 when it is increased and Y 8 when it is decreased. These finding are supported by the previously reported experimental data [22]. Thus, the analytical results using the instantaneous BR indicators indirectly support the reliability of our network identification approach.
Unlike other statistical approaches using correlation or causality, our approach can not only identify a metabolic reaction network but also provide a mathematical model simultaneously. Furthermore, it provides kinetic parameters which allows us to straightforwardly analyse the metabolic system using the obtained mathematical model.

Conclusions
The present study investigated an approach to identify a metabolic reaction network structure from time-series data of metabolite concentrations and simultaneously obtain its mathematical model in the S-system equations within the framework of BST. The Granger causality test was used to statistically identify interactions among metabolites and then remove metabolites which have insignificant causality to the considered metabolite. This result was used to form a mathematical model in the S-system representation. The parameters, namely, rate constants and kinetic orders in this mathematical model, were estimated by the Levenberg-Marquardt method. This estimation process was iterated to remove the least significant metabolite of each total influx and efflux according to the magnitudes of the kinetic orders. Consequently, the final form of the mathematical model was used to predict a probable structure for the metabolic reaction network system. A series of theoretical analyses clearly show that our approach is effective in identifying a metabolic reaction network. In the future, an in vitro experiment for measuring individual enzyme activities may also be performed on the basis of the prediction to reconstruct a newly possible metabolic pathway.

Methods
To efficiently identify a metabolic reaction network using timeseries data of metabolite concentrations, we use statistical and mathematical modelling techniques as described below.

Locally estimated scatter plot smoothing
The locally estimated scatter plot smoothing (LOESS) method is a non-parametric statistic which does not require any specific function to fit a mathematical model. Hence, it is very flexible in fitting experimental data containing noise or experimental errors. The regression function is locally approximated by the value of a function in some specified parametric class [23]. Such a local approximation is obtained by fitting a regression surface to data points within a chosen neighbourhood of the point: where y i is the ith measurement of the response y, x i is the corresponding measurement of predictors, g is the regression or the smooth function and e i is the random error. The weights are given by the tricube function: The value of weight function is low when x k is distant from x i . If its value is increased, the influence from the data points located in the neighbourhood will be increased. This results in increased smoothness of the smoothed points. A piecewise function is used to handle the data that cannot be properly fitted.

Bivariate Granger causality
The Granger causality test is a statistical hypothesis test used to determine whether one time series causes another. It is widely used in economics and has recently been employed to integrate omics data, i.e. transciptomics and metabolomics [13]. The present study introduces this test to evaluate causality among metabolites. Direct relationships between two metabolites were evaluated using the bivariate Granger causality test [24] on the basis of the following equations: y t~a0 za 1 y t{1 z:::za m y t{m zresidual t y t~a0 za 1 y t{1 z:::za m y t{m zb u x t{u z:::zb v x t{v zresidual 0 t ð13Þ where x and y are the stationary time series for testing the null hypothesis that x does not Granger-cause y. Appropriate lagged values of y are found and included in a univariate autoregression of y. The symbol m denotes the largest lag length for which the lagged dependent variable is significant, u is the shortest lag length and v is the longest length for which the lagged value of x is  significant. The time lag was set to be unity, denoting that the value at the present data point was used to predict the value at the next time point.
An F-test for equality of variances is used to verify whether the residuals are significant. An index measuring the strength of the causal interaction is defined as where RSS 1 and RSS 2 are the sums of squared residuals of residual t and residual t ', respectively. The null hypothesis is rejected if the F calculated from the data is greater than the critical value of the F distribution for some desired false rejection probability; the present study used 0.01 for the significant value.

Biochemical systems theory (BST)
Biochemical systems theory (BST) provides a powerful procedure for characterising biochemical systems [6][7][8]. BST describes nonlinear systems in terms of power-law functions. The present study uses the S-system representation within the framework of BST: where X i (i = 1,…, n) are the metabolite concentrations for n dependent variables and a i and b i are the rate constants of influxes and effluxes, while g ij (j = 1,…, n) and h ij (j = 1,…, n) are kinetic orders of influxes and effluxes, respectively. The S-system describes a metabolic reaction network by individually aggregating influxes and effluxes for each metabolite pool, reducing the number of parameters.

Levenberg-Marquardt algorithm
Non-linear least-squares methods use parameter estimation iterations to reduce the sum of the squared errors between each function value and a measured data point. LMA is a combination of the gradient descent method and the Gauss-Newton method [25][26][27].
The chi-squared error criterion is given as where y(t i ) is the measured value, y(t i ;p) is the curve fitting function and w i is the measure of error in measurement y(t i ).
The gradient of the chi-squared objective function with respect to the parameters is given as follows: where the weighting matrix W is diagonal with W = 1/w i 2 The Gauss-Newton method denotes the perturbed model parameters that are locally approximated by first-order Taylor series expansion asŷ y(pzh)&ŷ y(p)z Lŷ y Lp The Levenberg algorithm adaptively varies and updates the parameters between the gradient descent method and Gauss-Newton as where l represents the algorithm parameter. LMA implemented here uses the following modified equation [28]: Small values of l lead to a Gauss-Newton update, whereas large values of l lead to a gradient descent update.

Proposed algorithm
Our proposed algorithm is illustrated in Figure 1. The calculation starts with data smoothing by the LOESS method, followed by using the Granger causality test for removing an unnecessary metabolite and then estimating parameters in the Ssystem equations by LMA. In the iteration of this series of methods, parameters having the least effect are removed one by one under the criterion that each term in the equation has at least one metabolite or the R-squared value does not remarkably decrease within a satisfactory degree of fitness.