Model Classification for Uncertainty Reduction in Biochemical Kinetic Models

A persistent obstacle for constructing kinetic models of metabolism is uncertainty in the kinetic properties of enzymes. Currently, available methods for building kinetic models can cope indirectly with uncertainties by integrating data from different biological levels and origins into models. In this study, we use the recently proposed computational approach iSCHRUNK (in Silico Approach to Characterization and Reduction of Uncertainty in the Kinetic Models), which combines Monte Carlo parameter sampling methods and machine learning techniques, in the context of Bayesian inference. Monte Carlo parameter sampling methods allow us to exploit synergies between different data sources and generate a population of kinetic models that are consistent with the available data and physicochemical laws. The machine learning allows us to data-mine the a priori generated kinetic parameters together with the integrated datasets and derive posterior distributions of kinetic parameters consistent with the observed physiology. In this work, we used iSCHRUNK to address a design question: can we identify which are the kinetic parameters and what are their values that give rise to a desired metabolic behavior? Such information is important for a wide variety of studies ranging from biotechnology to medicine. We used the proposed methodology to find parameters that ensure a rate improvement of the xylose uptake (XTR) in a glucose-xylose co-utilizing S. cerevisiae strain. Our results indicate that only three kinetic parameters need to be accurately characterized to describe the studied physiology, and ultimately to design and control the desired responses of the metabolism. This framework paves the way for a new generation of methods that will systematically integrate the wealth of available omics data and efficiently extract the information necessary for metabolic engineering and synthetic biology decisions. Author Summary Kinetic models are the most promising tool for understanding the complex dynamic behavior of living cells. The primary goal of kinetic models is to capture the properties of the metabolic networks as a whole, and thus we need large-scale models for dependable in silico analyses of metabolism. However, uncertainty in kinetic parameters impedes the development of kinetic models, and uncertainty levels increase with the model size. Tools that will address the issues with parameter uncertainty and that will be able to reduce the uncertainty propagation through the system are therefore needed. In this work, we proposed a method called iSCHRUNK that combines parameter sampling and machine learning techniques to characterize the uncertainties and uncover intricate relationships between the parameters of kinetic models and the responses of the metabolic network. The proposed method allowed us to identify a small number of parameters that determine the responses in the network regardless of the values of other parameters. As a consequence, in future studies of metabolism, it will be sufficient to explore a reduced kinetic space, and more comprehensive analyses of large-scale and genome-scale metabolic networks will be computationally tractable.


Introduction
Kinetic models are one of the cornerstones of rational metabolic engineering as they allow us to capture the dynamic behavior of metabolism and to predict dynamic responses of living organisms to genetic and environmental changes. With reliable kinetic models, metabolic engineering and synthetic biology strategies for improvement of yield, titer, and productivity of the desired biochemical can be devised and tested in silico (1). The scientific community has acknowledged the utility and potential of kinetic models, and efforts towards building large-and genome-scale kinetic models were recently intensified (2)(3)(4)(5)(6)(7)(8)(9). Nevertheless, the development of these models is still facing challenges, such as partial experimental observations and large uncertainties in available data (10)(11)(12).
The major difficulty in determining parameters of kinetic models are uncertainties associated with: (i) flux values and directionalities (13-15); (ii) metabolite concentration levels and thermodynamic properties (13-15); and (iii) kinetic properties of enzymes (2,16). As a result of interactions of metabolite concentrations and metabolic fluxes through thermodynamics and kinetics, these uncertainties make parameter estimation difficult.
Quantifying these uncertainties and determining how they propagate to the parameter space is essential for identification of parameters that should be measured or estimated to reduce the uncertainty in the output quantities such as time evolution of metabolites or control coefficients (17,18).
In biological systems, large uncertainties and partial experimental data commonly result in a population instead of in a unique set of parameter values that could describe the experimental observations. Such population of parameter sets is typically computed using Monte Carlo sampling techniques (3-5, 8, 9, 11, 19-26). However, the problem is when certain properties differ among models in a model population. For example, one such property is flux control coefficients (FCCs)(17, 18,27). We used here the ORACLE framework (3,4,8,10,11,26,28,29) to compute a population of kinetic models along with the corresponding flux control coefficients with the aim of improving xylose uptake rate (XTR) of a glucose-xylose co-utilizing S. cerevisiae strain (30). We have found that in the same population of models that are consistent with the observed physiology FCCs can be different due to lack of data about kinetic parameters. This can lead to erroneous or conflicting conclusions and decisions about the system in metabolic engineering and synthetic biology studies.
In order to address such issues, we propose to formulate these problems as parameter classification: identify which of the parameters, if any, should be constrained so that the values of studied properties, such as FCCs, are in predefined ranges. For this purpose, we modified iSCHRUNK (in Silico Approach to Characterization and Reduction of Uncertainty in the Kinetic Models), a recently introduced machine learning approach that characterizes uncertainties in parameters of kinetic models, and identifies accurate and narrow ranges of parameters that can describe a studied physiological state (16). In iSCHRUNK, machine learning is combined with methods that generate populations of kinetic models (3-5, 8, 9, 11, 19-26) to data-mine the integrated data and observed physiology together with the kinetic parameters. We applied iSCHRUNK to identify the enzymes and their kinetic parameters that determine consistent FCC values related to XTR. Our results showed that by constraining only three parameters, corresponding to xylose reductase (XRI) and ATP synthase (ASN), consistent FCCs can be obtained for models computed around multiple steady-state metabolite concentrations. We further showed how the parameter classification and machine learning algorithms can be improved to more accurately identify the parameter subspaces that lead to welldetermined model properties.

Uncertainty in the xylose uptake responses to genetic manipulations
We analyzed the improvement of the xylose uptake rate (XTR) during mixed glucosexylose utilization in a recombinant Saccharomyces cerevisiae strain (30). The kinetic model of S. cerevisiae metabolic network was built around the reference steady-state of metabolic fluxes and metabolite concentrations (Methods). The model contains 258 parameters and describes 102 reactions and 96 intracellular metabolites distributed over cytosol, mitochondria and extracellular environment. The experimentally determined values of kinetic parameters were missing, and the analyzed system was undetermined, i.e., we had 102+96 computed values for steady-state fluxes and metabolite concentrations versus 258 unknown parameters. This meant that a multitude of parameter sets could reproduce the observed physiology, and we used the ORACLE framework that employs Monte Carlo sampling to generate a population of 200'000 kinetic models. We computed FCCs of the metabolic network and used them to rank enzymes according to their control over XTR, i.e., the highest ranked enzymes were the ones with the largest magnitude FCCs with respect to XTR. Among the top ranked enzymes, hexokinase (HXK), non-growth associated maintenance (ATPM), and NADPH reductase (NDR) had ambiguous control over XTR ( Fig 1A). The distributions of the control coefficients of XTR with respect to HXK, ATPM and NDR ( "#$ #%& , '%() #%& , and +,& #%& , respectively) were extensively spread around zero, and we could not deduce with certainty whether the control of these enzymes over XTR was positive or negative.  and chemical species is provided in S1 File.
The population of control coefficients "#$ #%& was nearly symmetric around zero with a mean of 0.007 and 47% of samples had negative values ( Fig 1B). We split the population of kinetic models based on the sign of "#$ #%& , and we analyzed the two populations with a negative (Fig 1C, left) and a positive (Fig 1C, right) control of HXK over XTR. The split in the population did not have a substantial effect on the majority of the control coefficients.
Interestingly, the exceptions were precisely the other enzymes with the ambiguous control over XTR, i.e., ATPM and NDR, which exhibited a negative correlation with HXK ( Fig 1C). This suggested that there were two distinct populations of control coefficients, and consequently that there were different values of some kinetic parameters that determined such metabolic responses.

Identification of significant parameters determining control of HXK over XTR
We used the Classification and Regression Trees (CART) algorithm (31,32) to identify significant parameters that determine responses of XTR to changes in HXK activity. The CART algorithm partitions the parameter space into hyper-rectangles determined by the ranges of parameters that satisfy the studied property. Here, we used as parameters the degree of saturation of the enzyme active site, σA (10), because this quantity is constrained in a well-defined range between 0 and 1 (Methods), and the desired property was the negative control of HXK over XTR. The inputs of parameter classification were: (i) the information for each out of 200'000 parameter sets whether or not it gave rise to the negative control of HXK over XTR; and (ii) parameter values of 200'000 parameter sets. A set of ranges of enzyme saturations that determine a hyper-rectangle computed by the CART algorithm we subsequently refer to as a rule.
To measure the performance of parameter classification we defined the performance index (PI), which quantifies a portion of parameter sets giving rise to the studied property.
In this work, out of all parameter sets that satisfy rules (or a rule) inferred by parameter classification, PI quantifies how many of them are giving rise to the negative control of HXK over XTR. For example, if 40% of models within a population of models satisfying a rule give rise to the negative control of HXK over XTR, then PI of this rule is 0.4.

Reduced number of parameters determine control of HXK over XTR
We performed parameter classification on 200'000 parameter sets of 258 parameters, and the algorithm identified 76 rules. In the identified rules, only 46 out of 258 parameters were constrained, whereas the remaining parameters had no effect on the control of HXK over XTR, i.e., their σA values could take any value between 0 and 1 (Methods). Kinetic subspaces defined by these rules had a portion of parameter sets giving rise to the negative control ranging from PI=0.50 to 0.78, i.e., 50-78% of parameter sets satisfying these rules resulted in the negative control (S2 File). This was a noteworthy improvement compared to the overall kinetic space with 47% of such parameter sets.

Preselection and identification of significant parameters
Our finding that a reduced number of parameters determined control of HXK over XTR  for Rule 1, Rule 2 and Rule 3, respectively (Fig 2). These results demonstrate that the parameter classification algorithm can reliably be used to identify the significant parameters and their ranges that give rise to the negative control of HXK over XTR.

Top 3 significant parameters
A closer inspection of the top ranked rules revealed that there were a few parameters such as ASN@pi_m or XRI@nadh_c (for notation see the caption of Table 1) that appeared rather consistently throughout the rules (Table 1 and S2 File). The appearance of a reduced number of parameters throughout the inferred top ranked rules suggested that these parameters were essential for a negative control of HXK over XTR. We hence ranked the parameters based on the number of their occurrences in the rules and by how much their ranges were constrained (Methods).
We first considered the top rule (Rule 1 in Table 1 and S2 File), and we computed the ranking score for the associated parameters. We then ranked the parameters for the top 2 rules (Rules 1 and 2), for the top 3 rules (Rules 1, 2 and 3), and so forth, and observed how the ranking score of the parameters evolved as we considered a growing number of rules ( Fig 3A). There was a clear separation in the ranking scores of a small number of parameters from the remaining parameters ( Fig 3A). Indeed, the three highest ranked parameters, XRI@nadh_c, ASN@pi_m, and TPI@t3p_c, were consistent for a large number of considered rules. This result suggested that it would be sufficient to constrain a combination of the ranges of the three parameters to ensure the negative control of HXK over XTR.

Qualitative dependency of negative control on top 3 significant parameters
We constructed a subspace of parameters by constraining the range of the top significant parameter, XRI@nadh_c, according to Rule 1 (Table 1), while the other parameters were unconstrained and could take any value between 0 and 1. Within this subspace, there were 53% of parameter sets giving rise to the negative control of HXK over XTR, i.e., PI=0.53 (Fig 3B, top). In such a way, we constrained the ranges of XRI@nadh_c based on the remaining top 10 ranked rules, and we analyzed how these ranges affected PI ( Fig   3B). There was a clear qualitative relationship between the ranges of XRI@nadh_c and PI.
Indeed, the values of PI ranged from 0.55 (Rule 2) up to 0.57 (Rules 8 and 9) for low values of XRI@nadh_c, whereas they were as low as 0.46 for middle range values of this parameter ( Fig 3B). We repeated this analysis for ASN@pi_m and TPI@t2p_c, and for higher values of these two parameters, PI was as high as 0.62 (ASN@pi_m) and 0.51 This analysis suggested that a subspace defined by constraining XRI@nadh_c to low values and ASN@pi_m and TPI@t2p_c to high values was likely to have a high PI.

Constraining top 3 significant parameters ensures the negative control of HXK over XTR
To combine the distributions of top 3 parameters that ensure a high PI in an unbiased way, we performed another parameter classification (Methods). The parameter classification algorithm inferred 66 rules on these three parameters, and the top rule enclosed 9389 samples with PI of 0.73 (S2 File). The PI value of 0.73 was close to the maximal PI value of 0.78, which was computed for the rules formed with all parameters.
As expected, the ranges of the three parameters defined by the top rule ( Fig 4C) were consistent with the analysis presented in the previous section.
We proceeded with the validation of the ranges of the top 3 parameters on a new population of models. We imposed the ranges of the top 3 parameters derived from the top rule of the parameter classification and generated a population of 100'000 models (Methods). We then computed the control coefficients of the top enzymes over XTR ( Fig   4). The control coefficient "#$ #%& was distinctively negative with a mean value of -0.09 ( Fig   4C), and its distribution was clearly shifted toward negative values compared to that of the original population of models ( Fig 1B). More than 72% of models had negative values of "#$ #%& compared to 47% in the original population of models. The value of PI of 0.72 obtained from the validation set was strikingly close to the predicted value of 0.73 from the second tree training. ASN@pi_m, and TPI@t2p_c that determined the negative control of HXK over XTR.

Robust significant parameters for multiple concentrations
In Metabolic Control Analysis (MCA), it is considered that the control coefficients depend only on elasticities, however this holds only when the reactions are irreversible and there are no conserved moieties. It has been shown that metabolite concentrations affect displacements of reactions from thermodynamic equilibrium, which in turn influence the control over fluxes and concentrations in the network (35,36). We hence investigated the possibility to find significant parameters that ensure a well-determined control over XTR for different metabolite concentration configurations. In a similar manner, we could cross sample the metabolite concentrations space and use the proposed method to identify regions in the space of thermodynamic displacement from the equilibrium that give rise a well-determined control over XTR.
We constructed two populations of 200'000 kinetic models for each of the two extreme metabolite concentration vectors, Extreme1 and Extreme2 (Methods). Overall, together with 200'000 parameter sets previously computed for the reference metabolite concentration (Reference), we had 600'000 parameter sets for parameter classification.

Significant parameters for negative control of HXK over XTR for three concentrations
For each of three populations of models, we preselected parameters based on the Fisher's linear discriminant score, and we performed the parameter classification to find parameter ranges that guarantee a negative control of HXK over XTR (S2 File). We used the inferred rules from the three parameter classifications to rank the top parameters over the three metabolite concentration vectors (Methods). Interestingly, the top seven parameters from the Reference case remained in the group of top seven parameters over the three cases (Fig 3 and S6 Fig). Moreover, the top two parameters (XRI@nadh_c, ASN@pi_m) from the Reference case (Fig 3) were the top two also for the Extreme1 and To refine the distributions of top 3 parameters for each of the three metabolite concentrations we performed additional three parameter classifications (Methods).
However, the refined distributions of top parameters that ensure a negative control of HXT over XTR might or might not coincide for the three concentrations. Therefore, to reconcile the parameter distributions for the three cases, we used the parameter sets defined by the top 3 rules for each of these cases as input for an additional parameter classification (Methods). The top rule obtained from this parameter classification enclosed 11'801 out of 600'000 parameter sets, and 70.9% of these models had a negative control of HXK over XTR (Table 2). We imposed the robust ranges for the top 3 parameters from Table 2 and generated a population of 100'000 models for each of the three metabolite concentration vectors, a total of 300'000 models. We then computed the control coefficients of HXK over XTR, and a significant improvement of PI was obtained for all three metabolite concentration vectors. Indeed, the Reference, the Extreme1 and the Extreme2 cases, had 72%, 72% and 67% of models with a negative "#$ #%& , respectively (compared to the 47%, 46% and 39% of models with unconstrained parameters). The average PI (0.703) over three populations of models was remarkably close to the predicted PI (0.709) ( Table 2).

Significant parameters for positive control of HXK over XTR for three concentrations
We repeated the procedure from the previous section using the same set of top 3 parameters, but we imposed a positive control of HXK over XTR as the objective for the parameter classification. The top inferred rule enclosed a significantly higher number of models (67482) compared to the case with a negative control of HXK over XTR (11801) ( Table 2). Assuming that the parameter space was sampled uniformly, this also suggested that the parameter subspace that ensures a positive control of HXK over XTR was larger than the one that ensures a negative control. Interestingly, there were no overlaps between parameter ranges for a positive and a negative control ( Table 2). The predicted PI for a positive control (0.776) was also higher than the one predicted (0.709) for the negative control.
We generated a population of 100'000 models for each of the three metabolite concentration vectors by imposing the robust distributions of the top 3 parameters ensuring a positive control ( These results showed that few parameters determine whether the control of HXT over XTR is negative or positive, meaning that the operational states of few enzymes are vital to responses of the metabolic network upon perturbations.

Improved parameter classification through reassignment
We found no rule with PI equal to 1 in performed parameter classification studies. This suggested that the parameter subspaces leading to a negative and a positive control of HXK over XTR were not distinctly separated. To improve the parameter classification for the problems where the separation between the classes is fuzzy, we propose to employ the k-nearest neighbors (k-NN) algorithm (Methods). The k-NN algorithm allows us to identify the parameter sets from one class that are surrounded by the parameter sets of the other class and reassign them to the latter. In the context of finding parameter values that give rise to a certain property, this means that the parameter classification algorithm will find only those parameter sets that are surrounded by a majority of the parameter sets of the same class. This way, the separation between the classes will be increased at the expense of neglecting parameter sets from the regions with a heavy overlap of the classes.
We reconsidered the classification for parameters determining a negative control of HXK over XTR in the Reference case, and we applied the k-NN algorithm with k=10 over the set of initial 200'000 parameters in order to find the surrounding for each of parameter sets, and to perform the reassignment (Methods). If in the group of 10 closest neighbors of a parameter set the percentage of parameter sets from the same class was less than a reassignment threshold, r, then that parameter set was reassigned. We performed two parameter classification studies for two different reassignment thresholds, r, of 30% and 50% (Methods).
We found that as the reassignment threshold was increasing the tree training algorithm was inferring a smaller number of rules (73 for r = 30% versus 31 for r = 50%).
Furthermore, the inferred rules were enclosing a smaller number of parameter sets for higher values of r, i.e., for r = 30% and 50%, the top rules enclosed respectively 13427 and 1339 parameter sets (Fig 5 and S2 File). In contrast, the obtained PIs, were higher for r = 50% than for r = 30% (Fig 5). For example, PI of the top rule for r = 50% was 0.83, whereas the one for r = 30% was 0.73 (Fig 5 and S2 File). A comparison between the original method with preselection, which is identical to the reassignment method with r = 0% (corresponding to no reassignment), and the reassignment methods for r = 30% and 50% showed a general tendency of the latter for obtaining rules with improved PI and that enclose a smaller number of parameter sets (Fig 5). The rules are obtained with: the original method with preselection (red diamonds), the reassignment method with 30% threshold (blue crosses) and the reassignment method with 50% threshold (green asterisks).
We also tested the reassignment procedure for parameters determining a positive control of HXK over XTR in the case of the reference metabolite concentration with k=10 and r = 60%. The classification algorithm inferred 19 rules with PIs ranging from 0.75 to 0.90.
The rules were defined by only 28 parameters (S2 File). The top rule enclosed 1711 parameter sets with PI of 0.90, and it was defined by 6 parameters.
To validate the proposed improvement to the parameter classification, we imposed the distributions of the parameters defined by the top rules for the negative control case with r=50%, and for the positive control case with r=60% (S2 File). We generated for each study a population of 100'000 models, and we computed the control coefficients in the network.
In the case of negative control, the distribution of the control coefficient "#$ #%& was biased toward negative values with mean -0.13 (Fig 6A and 6B). More than 79% of the computed control coefficients "#$ #%& were negative. (Table 4). Similarly, in the case of positive control, the distribution of the control coefficient "#$ #%& was shifted toward positive values with the mean of 0.21 and a remarkable PI of 0.89 (Fig 6D and 6F, and Table 4).
For the negative and positive cases, the top rules were defined by 6 parameters each, where three parameters, XRI@nadh_c, ASN@pi_m, and TPI@t3p_c, were common for both cases (Fig 6C and 6E). These three parameters were also ranked as the top 3 parameters in the parameter classification with the original algorithm (Fig 4). This result clearly demonstrated that the reassignment procedure allows for more precise identification of the subspaces leading to a desired control of HXK over XTR. We observed improvement of both PI and the mean "#$ #%& value compared to the results obtained with the unaltered parameter classification algorithm.  iSCHRUNK also allows us to combine several requirements simultaneously. For example, we can use iSCHRUNK to identify and quantify the parameters that simultaneously satisfy conditions on "#$ #%& > 0 and that enforce a desired level of yield or specific productivity of some compounds of interest. Given that desired responses are feasible, we can perform this kind of analysis for an arbitrary number of requirements.
iSCHRUNK lends itself to a broad range of applications in systems biology regarding both the analysis of metabolism and the design and control of the desired behavior of metabolic networks. It allows us to analyze the relationships between the inferred parameter ranges and the measurements acquired on the actual biological system, and, consequently, to create hypotheses regarding the operating states of enzymes. For example, a follow-up experiment testing ambiguous control of HXK over XTR was performed in (30), and it was shown that HXK2 deletion improves xylose uptake rate.
Based on this experimental observation, one can hypothesize the operating ranges of top enzymes such as ATP synthase (ASN), triose phosphate isomerase (TPI) or xylose reductase (XRI). In principle, iSCHRUNK can provide to experimentalists information about saturations of all enzymes in the network. Information on enzyme properties and their operating ranges can be used as guidance to design the desired metabolic behavior in applications ranging from sustainable production of biochemicals to medicine.

Identifying significant parameters that determine studied properties
The computational method for characterization and reduction of uncertainty, iSCHRUNK, was proposed in (16). iSCHRUNK involves a set of successive computational procedures that can help us to ascertain and quantify the kinetic parameters that correspond to a given physiology. iSCHRUNK can be used with any method that generates populations of kinetic models describing given physiology such as ensemble modeling (23) or ORACLE (3,4,8,10,11,28,29), and can be used to find parameters that give rise to a wide variety of properties. Here, we modified iSCHRUNK to identify distribution of kinetic parameters that determine the sign in ambiguous distributions of control coefficients as follows ( Fig   7A): (TCA), electron transport chain (ETC) and XR/XDH xylose assimilation pathway (30,37). Based on the physiological information on the cellular compartmentalization the intracellular metabolites were categorized as cytosolic or mitochondrial, and the extracellular metabolites were modeled as well. We integrated the thermodynamic constraints based on the information about the Gibbs free energies of reactions (38)(39)(40)(41) together with the fermentation data from Miskovic et al. (30), and we then performed the Thermodynamics-based Flux balance Analysis (TFA) (4, 13-15, 42, 43) to compute the thermodynamically consistent steady-state flux (S8 File).

II.
We sampled the space of metabolite concentrations that is consistent with: (i) the directionalities of the steady-state flux obtained in step I; and (ii) the available observations of metabolite concentration ranges (4,8).  (10). Without prior information, we sampled σA values between 0 (non-saturation) and 1 (full saturation). Otherwise, we performed the stratified sampling where we imposed the σA distributions obtained from the classification algorithm in Step VII (Fig 7A and 7B). An alternative to sampling σA values would be to sample the enzyme states (26).

IV.
We verified the local stability of the steady-state (10), and we rejected the kinetic parameters corresponding to unstable steady states and the ones that are not consistent with the experimentally observed data and literature.

V.
In this step, we analyze whether or not the studied property is satisfied. If yes then we proceed to step VII, otherwise we perform the parameter classification in Step VI We then verified if the control of HXK over XTR was well determined. We defined the control of an enzyme over the analyzed quantity as being well determined if 50% of control coefficients around the mean control coefficient had the same sign. For example, in the population of the control coefficients of XTR with respect to xylulokinase (XK) all the samples between the 1 st and the 3 rd quartile were negative (Fig 1A), and hence we considered that XK had a well-determined negative control over XTR. In contrast HXK, ATPM and NDR had an ambiguous control over XTR ( Fig   1A). If HXK had well-determined control over XTR, we proceeded to Step VII.
Otherwise, we went to Step VI.

VI.
We fed back to the classification algorithm the population of the analyzed control coefficient from Step V together with the corresponding values of the parameters (degree of saturation of the enzyme active site σA ) from Step III. The classification problem was defined to find the ranges of the σA values (and consequently the ranges of the corresponding Km values) that determine the sign, positive or negative, of the analyzed control coefficient. We solved this parameter classification problem using the CART algorithm (31, 32) from the MATLAB software package.
We then used the output of the parameter classification, the distributions of σA, for the sampling in Step IIIb (Fig 7A). More details about the parameter classification are presented in the next section.

VII.
In this step, we can postulate hypotheses and design systems biology strategies. In Step V we entered an iterative loop for identifying the ranges of σA (or equivalently Km) for which the analyzed control coefficients were well determined (Fig 7). The iteration started by passing the invalidated σA values from this step to the classification algorithm in Step VI. We then used the refined σA distributions from Step VI in the sampling procedure in Step III. Next, the refined samples of σA were next checked for consistency in Step IV, and finally, we constructed a new population of control coefficients in Step V and verified it. At each iteration, the σA values (Km values) that reduced the ambiguity in the population of the analyzed control coefficients were refined and used for stratified sampling in Step III.

Parameter classification
We carried out the parameter classification in several steps (Fig 7B). We first removed from the consideration the parameters that were not affecting the control over the analyzed flux. We then used the CART algorithm with the preselected parameters for three populations of kinetic models where each population was computed with a different metabolite concentration vector (see Step II of the framework discussed above). In the third step, we ranked the parameters over three concentrations, and we chose the top parameters to continue. We next refined the distributions of the top parameters for each concentration individually, and we then used this information to determine the consistent distributions of top parameters over all concentrations. We detail the parameter classification steps below.

Preselect parameters
Our preliminary results indicated that only a subset of kinetic parameters affected the sign of the analyzed control coefficient. The reduction in the parameter space was in agreement with our previous study (16), and inspired us to assess which parameters had a negligible effect on the computed control coefficients, to discard them, and then to proceed with the parameter classification. The benefits of preselecting the parameters are twofold. First, applying a computationally inexpensive method for preselecting the parameters and then using the CART algorithm on the reduced space can significantly reduce computational requirements of iSCHRUNK. Second, the parameters with a negligible effect on the control coefficients can introduce a bias in the estimates of key parameters. We can eliminate this bias by discarding the irrelevant parameters.
We used the Fisher's Linear Discriminant score (33, 34) to preselect the parameters: where E and F denote the mean values of the parameter populations that result in negative (or positive) and non-negative (or non-positive) control coefficients, while E F and F F are the corresponding variances. The higher S was, the larger was the influence of the analyzed parameter in discriminating between a positive and a negative control. We ranked the parameters according to this score, and we kept the parameters whose scores were at least 1% of the highest obtained score.

Train classification tree and rank inferred classification rules
For each of the three metabolite concentration vectors, we trained a classification tree.
The classification algorithm inferred classification rules based on the values of the preselected σA parameters and the outcomes, e.g., negative and non-negative control coefficients, obtained with these σA parameters. Each rule corresponds to a set of inequalities defined on different parameters. Thus, a rule is a hypercube in the parameter space. The number of inferred rules depends on the properties of the parameter space to be classified and also on the number of parameter sets that are used to train the algorithm.
In order to prevent the overfitting, we fixed to 200 the minimal number of parameter sets that the algorithm can use to construct a rule (16,32,51). The rules defined by a large number of parameter sets are "more certain". Besides, assuming that we sampled the parameter space uniformly, the "more certain" rules will likely enclose a larger volume of the parameter space with the well-determined control. Therefore, for each metabolite concentration vector, we ranked the inferred classification rules according to the number of parameter sets they contained.

Rank parameters across classification rules and over all concentrations
To rank the parameters of the models obtained for a concentration vector such as reference concentration vector or one of the extreme concentration vectors, we defined the following score for parameter : . We used this score to rank the parameters; the higher the score was for a parameter, the higher was its ranking.

Determine robust distributions of top parameters over all concentrations
The refined distributions of top parameters that correspond to a well-determined control might well mismatch among the three cases (Reference, Extreme1 and Extreme2).
Therefore, some parameter values can correspond to a well-determined control for one metabolite concentration vector and to an ambiguous control for the other metabolite concentrations. To obtain the consistent distributions of the top parameters over all concentrations in an unbiased way we performed the third tree training as follows.
For each of the three metabolite concentrations, we took as the input to parameter classification the parameter sets whose ranges of top parameters were defined according to top 3 rules. We used in parameter classification the parameter sets from the subspace of top rules as these parameter sets are likely to have well-determined control at least for one of the concentrations. We then verified for each top parameter set if it corresponds to a well-determined control for the three metabolite concentration vectors. If a parameter set corresponded to a well-determined control for the three concentrations (S12 Fig, red stars), we considered it consistent; otherwise, it was considered inconsistent (S12 Fig, blue and yellow stars). We fed this information as the second input to the classification algorithm and performed the training.
The obtained consistent distributions of top parameters over the three cases were used in Step III to perform a stratified sampling.

Reassignment procedure for improved tree training
In the cases when the space of parameter sets leading to a negative and the one leading to a positive control over analyzed quantities are overlapping, the separation between parameter classes is fuzzy. To enhance the separation between the classes, we propose here utilization of the k-nearest neighbors (k-NN) algorithm in the parameter classification as follows (52).
For each of the parameter vectors, we first assessed whether or not they were determining, e.g., a negative control, and we assigned them to two distinct sets. The first set, SN, contained parameter vectors that gave rise to a negative control, whereas the second set, SP, contained the ones that gave rise to a non-negative control. We then ran the k-nearest neighbors (k-NN) algorithm, and for each parameter vector from the set SN, we computed how many out of its k-nearest neighbors belonged to the same set (SN). For each of these parameter vectors, if the percentage of k-nearest neighbors that belonged to the set SN was higher than a pre-specified reassignment threshold, r, we then retained that vector in the set SN. For instance, for r = 50%, if more than 50% of k-nearest neighbors of the analyzed parameter set belonged to the set SN, that parameter set remained in the set SN. Otherwise, we re-assigned that parameter vector to the set SP.
With the proposed reassignment procedure, we emphasized the regions of the parameter space that have a higher proportion of parameter vectors belonging to the set SN.
The reassignment procedure introduced two new parameters: the reassignment threshold, r, and the number of nearest neighbors, k. The values of r were chosen on the basis of the initial, unbiased, sampling that was performed in Step III. Specifically, from the initial sampling we could assess the average percentage of SN parameter vectors in the set of all vectors. We then set r to be a larger than the average percentage so that the parameter classification algorithm could identify the regions in the parameter space with the above than average proportion of SN vectors. Assuming that the parameter space was sampled uniformly, we use the parameter k to choose the larger or smaller part of the parameter space around the analyzed parameter vector for a possible reassignment. Very large values of k are not recommended as the reassignment procedure would consider the overall parameter space and no samples would be retained in the set SN as r is chosen to be larger than the average percent of SN vectors in the overall set of parameter vectors.

Bayesian inference and parameter classification
Bayesian inference relies on use of Bayes theorem to compute the conditional distribution of a parameter vector given observed data x: For this type of studies, the ABC rejection algorithm (54) can be used as follows. First, the prior distribution of kinetic parameters is generated using the ORACLE framework or any other method that uses Monte Carlo sampling of uncertain parameters for constructing populations of kinetic models (3-5, 8, 9, 11, 19-26). The corresponding control coefficients are next computed, and the parameter classification algorithm is then used to discard parameter vectors from the prior that gave rise to ambiguous control over analyzed quantities. As a result, the retained sampled are distributed according to the approximate posterior distribution of kinetic parameters that give rise to welldetermined control over analyzed quantities.

Conclusions
Monte Carlo sampling techniques have traditionally been used for analysis of the metabolic flux space of networks, and recently they are extensively employed to address the uncertainty in parameters of kinetic models. As the size of the models and complexity of studies increases, sampling a kinetic space becomes increasingly difficult and even intractable. We have recently proposed a methodology that allows us to identify relevant kinetic parameters that correspond to the observed physiology (16). One of the key results of that work was that a small set of parameters corresponding to a few enzymes is sufficient to characterize a physiology. This finding has a significant impact as it will allow sampling of a reduced kinetic space, thus alleviating issues with high computational requirements of sampling.
In this work, we modified iSCHRUNK to find the parameters and their values that give rise to a chosen property. As the studied property we choose the control of hexokinase over xylose uptake. Our study showed that it is possible to identify only a few parameters whose values can make a control coefficient to be negative or positive irrespectively of the values of the other parameters in the model. This information is crucial for biotechnology studies where living cells need to be engineered for improved performance, or for drug discovery studies where, e.g., we want to overproduce a compound toxic to a pathogen.
In summary, the proposed method can be used for a wide gamut of applications in systems biology. It can be used in conjunction with multiple criteria, which can be both quantitative and qualitative, and can be applied not only to identify distributions of kinetic parameters but also to determine distributions of the metabolic fluxes and metabolite concentrations satisfying given requirements.        for ' = 0, the whole range, i.e., ' ∈ (0,1), of a parameter was considered, and the corresponding PI was calculated. For ' = 1, PI was calculated for a fixed value ' = 1.