Assessment of FBA Based Gene Essentiality Analysis in Cancer with a Fast Context-Specific Network Reconstruction Method

Motivation Gene Essentiality Analysis based on Flux Balance Analysis (FBA-based GEA) is a promising tool for the identification of novel metabolic therapeutic targets in cancer. The reconstruction of cancer-specific metabolic networks, typically based on gene expression data, constitutes a sensible step in this approach. However, to our knowledge, no extensive assessment on the influence of the reconstruction process on the obtained results has been carried out to date. Results In this article, we aim to study context-specific networks and their FBA-based GEA results for the identification of cancer-specific metabolic essential genes. To that end, we used gene expression datasets from the Cancer Cell Line Encyclopedia (CCLE), evaluating the results obtained in 174 cancer cell lines. In order to more clearly observe the effect of cancer-specific expression data, we did the same analysis using randomly generated expression patterns. Our computational analysis showed some essential genes that are fairly common in the reconstructions derived from both gene expression and randomly generated data. However, though of limited size, we also found a subset of essential genes that are very rare in the randomly generated networks, while recurrent in the sample derived networks, and, thus, would presumably constitute relevant drug targets for further analysis. In addition, we compare the in-silico results to high-throughput gene silencing experiments from Project Achilles with conflicting results, which leads us to raise several questions, particularly the strong influence of the selected biomass reaction on the obtained results. Notwithstanding, using previous literature in cancer research, we evaluated the most relevant of our targets in three different cancer cell lines, two derived from Gliobastoma Multiforme and one from Non-Small Cell Lung Cancer, finding that some of the predictions are in the right track.

To properly define max i v for each reaction, we perform a Flux Variability Analysis (FVA) [3] under constraints (1)- (3). Uptake reaction bounds from the growth-medium under consideration are included in Eq. 1. We also limit the maximum flux through any reaction in the network for which no bound is given to 1000 before applying FVA in order to avoid unbounded fluxes. We also define a continuous variable i z for each reaction, bounded between 0 and 1 (Eq. 4), which may force a minimum flux through its associated reaction, i v (Eq. 5). δ is a strictly positive constant with a maximum value of 1 that fixes the lower bound on v i in relation with the value of z i with respect to max i v . The inclusion of max i v in Eq. 5 as calculated by FVA allows us to set an activation threshold independent of the stoichiometric representation. We remark that this set of variables is continuous, as in [4], and not binary, as in a number of previous works [5,6].
Our objective is to minimize the number of reactions in L while maximizing those in H. For that, our objective function minimizes the sum of fluxes through reactions belonging to L with a weight W L , as well as the flux through reactions in M with a weight W M , while maximizing the number of reactions in H using z variables with a weight W H (Eq. 6). The term δ· max i v in Eq.6 allows us to avoid the flux bias introduced by the specific stoichiometric representation of reactions. Note here that blocked reactions ( 0 max  i v ) are removed and can be left out of Eq. 6. Different criteria to establish these weights are discussed in the Results section of the main paper.
max 1 max min   (6) As noted above, it is common to set i z as a binary variable, but relaxing that constraint, as done here, achieves the same "flux diversification" effect desired [4]. Minimizing the sum of fluxes for L and M is not the same as minimizing the number of reactions in L and M, but it allows us a linear formulation of the problem without negatively influencing the final solution in terms of quality. Overall, with these features, we avoid a mixed binary formulation, harder to solve because of the integrality constraints on some of the variables [7].
Since we have split the reversible reactions into two irreversible steps, but have added no constraint guaranteeing that only one of them is active at a time, solving this problem (Eq.6 subject to Eq.1-Eq.5) will give us a solution where all forward and backward steps from reversible reactions in H are active, even if their net flux ( b f v v  ) is zero. Note that this does not occur with reversible reactions in L or M, because minimizing the sum of fluxes already enforces the usage of reversible reactions, if necessary, only in one direction. For this reason, we need an iterative procedure that disentangles whether these reversible reactions in H can certainly be included in the reconstructed network.
On the other hand, the solution resulting from this step directly provides us with the subset of irreversible reactions from H that will be involved in our final reconstruction. For this reason, the flux of irreversible reactions from H that have not been activated in this first step is set to zero for the rest of the iterative process (Eq. 7).
Overall, this first step provides a first draft network D that will be expanded in the next steps. Reversible reactions in H with net flux equal to zero cannot be directly included in D and require further analysis to evaluate their presence in the final reconstruction.

Iterative process for reversible reactions in H
The aim of this iterative process is to determine which reversible reactions in H will be part of the final reconstructed network, in particular those with net flux equal to zero in the previous step. During the iterative process, we will gradually increment the number of reactions included in our draft network D. In each iteration, we will set the penalty W L and W M of those reactions already included in the solution in previous iterations to zero, as once a reaction is included in the draft, there is no need to penalize it further. Similarly, we will set the W H bonus of reversible reactions in H already included in the draft from previous iterations to zero. Note that the W H bonus of irreversible reactions in H is kept to guide the addition of reactions in D. These variations lead to a new objective function, which is represented by Eq. (8).
In order to evaluate whether a reversible reaction from H, currently not included in D, must be added into the reconstruction, we need to solve the linear program defined by Eq. 8 s.t. Eq. 1-5,7, in two different scenarios: one with flux equal zero in the forward direction, and the other with flux equal zero in the backward direction, in the first scenario or/and 0  f v in the second scenario, then this reversible reaction takes part in the final reconstruction, as well as additional reactions from the sets H, M and L required to perform in steady state. We may also need to add other reversible reactions from H currently not in D and, therefore, their analysis will not be further required. In case that 0  b v in the first scenario and 0  f v in the second scenario, this reaction is discarded from the final reconstruction. We will refer to this process as Iteration A.
The strategy described above, though general, may require a large number of linear programs, as we need to individually check each reversible reaction. In order to reduce computation time, we introduce an intermediate algorithmic step, based on the concept of reduced cost from linear programming, which allows us to minimize the number of linear programs to be solved. Full details are provided below.

Efficient implementation of iterative process for reversible reactions in H
If we knew in which direction was going to work each reversible reaction in a possible solution, we could block the reactions in the opposite direction and solve the previous problem to recover that solution. However, as this is not the case, we will make a guess and then use linear programming theory to further improve it.  Figure A1 and A2. Note that in Figure A2, forward direction for reactions r3 and r4 have been chosen the opposite as in Figure A1. The application of Iteration B would activate all the reactions in Figure A2 if the couple (r1, r2) or (r3, r4) are assigned to the J set. If that is not the case, we would need to check each reversible reaction individually using Iteration A.
We will first solve the linear program defined by Eq. (8) subject to Eq. (1)-(5), (7), in two different scenarios. In particular, for all the reversible reactions in H not included in D, we will set their fluxes to zero in one direction first ( ) and later in the other ( , similarly to what is done in the FastCC algorithm in [4]. In addition, we will relax the bounds in the other direction, as observed in Eq. (9). The solution to this linear program may provide new reactions to the draft network D (see Figure A1), but the solution might have been better if we had selected some reversible reactions in the opposite direction.
Here, we can make use of linear programming theory to improve upon our solution and eventually reduce the number of reactions that will need to be checked individually. Specifically, we will make use of the concept of reduced cost of a variable. This value indicates how much the objective value would theoretically change if we modify the value of a variable by one unit.
It is important to clarify that, for the determination of reduced costs, it was assumed that the proposed linear problem is solved handling the bounds on variables implicitly. In fact, most available linear programming solvers implicitly handle variable bounds, meaning that those bound constraints are not explicitly added to the constraint matrix. Under these circumstances, the non-basic variables are no longer necessarily zero and their reduced cost can take any real value.
For readers unfamiliar with linear programming, variables are classified as basic or non-basic. Non-basic variables are independent variables and their value is set equal to their upper or their lower bound. By contrast, basic variables are dependent variables and their value is obtained by solving the corresponding system of equations [7].
In the optimal solution, reduced cost of basic variables is zero, while it is usually nonzero for non-basic variables, unless the problem has alternative optimal solutions, where non-basic variables may have a reduced cost of zero. For a minimization problem, the reduced cost of non-basic variables at their lower bound will be positive, and for non-basic variables at their upper bound, it will be negative. These reduced costs can also be interpreted as the shadow prices of the lower and upper bound constraints, respectively, on these variables.
With the solution to our modified problem in our hand, we set the focus on the reduced costs of b z variables associated to b v variables, for which we have relaxed the lower bounds. If their reduced cost is positive, a decrease in their value implies a reduction in the objective function value. In this case, if we allow a small negative value for b z , the corresponding b v would be able to take a negative value (see Eq. (9)). This may be sufficient to activate another reaction from H, allowing another z variable to take a positive value and, thus, improving the objective function value. If their reduced cost is zero and they are non-basic variables, we have alternative optimal solutions, implying that a small change in the lower bound of that variable could lead to a different optimal solution and, therefore, we also need to look at these variables.
) that are non-basic and have a non-negative reduced cost are stored in the set J. Then, backward reactions in J are fixed to zero, and a positive flux for their associated forward reactions is enabled, as shown in Eq. (10)(11).
Therefore, we solve the following optimization problem: Eq. (8) subject to Eq. (1)-(5), (7), (10)- (11). We repeat this process but starting with the reactions in the opposite direction, this is, switching f and b in Eq. (9)- (11). The whole procedure is repeated until no new reaction from H is added to the network. We refer to this procedure as Iteration B (see Figure A2).
Once Iteration B has ended, we may have reactions in J not included in D. However, some of them could possibly be included in the reconstruction. The reason for not having them included during Iteration B is that we should have reversed only a subset of them. Thus, the final step is to apply the procedure described in Iteration A for those reactions that remain included in J but not in D.

Reaction classification
Reactions are classified as highly, medium or lowly expressed using gene-protein-reaction rules and the gene expression classification [8] as mentioned in the main paper. These rules establish a relationship between the enzymes that catalyze each reaction and the genes that code for those enzymes. A reaction may be catalyzed by a single enzyme, different isozymes or a protein complex. Reactions having OR rules associated can be catalyzed by different isozymes, while those having AND rules involve protein complexes.
If a reaction is associated to a single gene, it is classified as H, M or L depending on the classification of the corresponding gene. If it involves an OR rule, it is classified as H if one of the genes is classified as H. On the contrary, it is classified as L if all the genes are classified as L. If a reaction involves an AND rule, it is classified as H if all the genes are classified as H, while as L if any of the genes is classified as L.
Those reactions for which no gene expression is available or that are not related to any gene (e.g. spontaneous reactions) are classified as medium expressed.

Gene Essentiality Analysis
A gene whose knock-out decreases the flux through the biomass reaction below 1e-6 is considered as essential. Based on empirical computational evidence, we found that this threshold has limited influence in the results.

iMAT reconstruction coverage comparison
There are several proposals of reconstruction algorithms in the literature, like MBA [5], MIRAGE [9], GIMME [10], GIM 3 E [11], INIT [12], iMAT [6], MADE [13]. Most of these algorithms rely on Mixed Integer Linear Programming (MILP) in order to select the active reactions for the contextualized reconstruction according to some predefined optimality criteria. Usually, each reaction is assigned a score as to how likely it is to be present in the reconstruction under consideration. This score can be obtained from genomics, transcriptomics, proteomics or other sources of data, or even from a combination of some or all of them. All the reconstruction methods have their place as some may be better suited for the integration of one type of data than others. Likewise, the results obtained from each one of them are not easily comparable, as each one aims for slightly different things.
In our case, our algorithm is closest (in the way it treats the inclusion of reactions) to iMAT [6]. The original optimization model proposed by iMAT for network reconstruction is as follows: subject to: These optimization problem aims to strike a balance between the inclusion of H reactions and the exclusion of L reactions. It does not directly consider the inclusion of reactions that are not H or L. Our algorithm includes a term to control the inclusion of those reactions (M set) and promote compact solutions.
Aware of the possible existence of alternative solutions, iMAT proposes an iterative solution scheme to assign a confidence score for the inclusion or exclusion of each reaction. This step, however, is not compulsory for our task, and we will only solve the optimization problem once.
iMAT does not require a definition for the growth medium nor the specification of a biomass function, although they can be included into the formulation if desired [6]. When comparing iMAT to our algorithm, we will set the same medium conditions and ask for a minimum biomass production in order to have them in the same conditions. Another modification we will introduce to iMAT is in the definition of ε, which will be selected depending on the reaction bounds, as we do in our algorithm.
subject to: We select δ=0.10, the same value we use in our algorithm. When retrieving the solution, we consider as active any reaction with flux greater than 10 -8 . If ε is lower than 10 -8 , we set it to that quantity instead.
We use Cplex to solve the optimization problem. In addition, we exit the optimization process when the relative optimality gap is below 0.5% (gap between the relaxed problem solution and the best integer solution found), as closing the gap completely can be extremely memory and time consuming and adds little to the solution quality.
We used this implementation of iMAT and our algorithm to reconstruct cancer networks with gene expression obtained from the Cancer Cell Line Encyclopedia (CCLE) and Recon2 for the base network as in the main paper. The median time to obtain a network with iMAT was 57.15 seconds, while it was 2.14, 2.94 and 1.81 for schemas 1, 2 and 3 of our algorithm, respectively. This is clearly an improvement over the computation time needed to obtain a reconstruction.
In terms of how similar the reconstructions with each algorithm are, we can compare the percentage of H and L reactions included in each case. In Figure B, we plot the difference between the percentage of L reactions included with iMAT and the percentage included with our algorithm using Schema 2 versus the difference between the percentage of H reactions included with iMAT and the percentage included with our algorithm. We chose Schema 2 because it is the one that behaves most similar to iMAT. It can be observed that the number of L reactions included is very similar and the number of H reactions included by our algorithm is somewhat lower. Overall, both methods obtain similar reconstructions in terms of the number of H and L reactions they include. Thus, we consider our algorithm a valid tool for the task at hand. In the case of the other schemas, for schema 3 our algorithm includes more H and L reactions than iMAT, and for schema 1 it includes less, as expected.

Model parameters and reconstruction
Here, we replicate the same analysis done in the main paper using Recon 1 instead of Recon2. The computation time is lower than in the case of Recon2, mainly because Recon1 is smaller than Recon2 (see Figure C).   In addition to replicating the analysis using Recon1 instead of Recon2, we also substituted the values of some parameters one by one in order to test the robustness of our approach. In the results presented in the main paper, we fixed  =10 3 , * and the value of  =0.10 by  =0.01. Whenever we changed one of these parameters, we left the others with the values used in the main paper. Furthermore, we also tested a very general medium in substitution of the RPMI1640 medium used in the main paper. This general medium was composed by those metabolites with an annotated input reaction in the reference network. We provide the mean computation times and H and L coverage in Table A, finding a robust solution for each different schema. As partially expected, the major difference was found in the case of the usage of a general growth medium.

Gene essentiality analysis
The selected CCLE cell lines include samples from 20 different cancer types: bladder, bone sarcoma, breast, colon, endometrial, esophageal, glioblastoma, gastric, leukemia, liver, lung mesothelioma, lung NSCLC, lung SCLC, melanoma, multiple myeloma, ovarian, pancreas, prostate, renal cell carcinoma and soft tissue sarcoma. As stated in the main text, we found a small number of genes that appear exclusively in samples from one cancer type. With the settings used in the main paper, schema 1 gives in total 22 genes that appear only in one cancer type, schema 2 gives 21 and schema 3 gives 6. For the parameter settings used in Table A, a similar conclusion can be achieved (see Table B). Figure E, Figure F, Figure G and Figure H are analogous to Figure 3 in the main text, but focusing on the GBM samples present in CCLE, U251 samples from GEO, U87 samples from GEO and A549 samples from GEO, respectively. See S1 Table for a list of the samples used.

Comparison to high-throughput gene silencing experiments
We replicate here the same analysis done in the main paper for comparing the obtained set of essential genes with the reported scores for Project Achilles using Recon1, instead of Recon2. The results are very similar to the ones obtained in the main paper. For other parameter settings, the median values of the KS test p-values and the percentage of essential genes with a negative score can be found in Table C.   As mentioned in the main text, we tried to leverage the information on the frequency of appearance of the computationally obtained essential genes in the randomly reconstructed networks and networks reconstructed from samples of a same type of cell line with respect to the experimental essentiality data, but we did not extract any conclusive results, most likely due to small sample size. However, when we merged the results for the different CCLE samples, in the case of the ones obtained from the reconstructions with schema 3, we observed that when we focused only on essential genes with a frequency of appearance in the randomly reconstructed networks lower than a given value, as we decreased this value, the percentage of calculated essential genes with a negative Achilles-based score underwent a moderate increase ( Figure K and Figure L).

Figure K.
Percentage of computationally obtained essential genes satisfying Achilles-based threshold of essentiality as a function of their frequency of appearance in randomly reconstructed networks. For a given threshold value (x-axis), we took all the calculated essential genes that appeared with a lower frequency than that threshold in the networks reconstructed from random expression, and calculated the percentage of genes associated to a negative Achilles-based score (y-axis). The networks from which the essential genes were calculated were reconstructed from the CCLE samples using Recon2 as the base network and the default set of parameters used in the main text.

Figure L.
Percentage of computationally obtained essential genes satisfying Achilles-based threshold of essentiality as a function of their frequency of appearance in randomly reconstructed networks. For a given threshold value (x-axis), we took all the calculated essential genes that appeared with a lower frequency than that threshold in the networks reconstructed from random expression, and calculated the percentage of genes associated to a negative Achilles-based score (y-axis). The networks from which the essential genes were calculated were reconstructed from the CCLE samples using Recon1 as the base network and the default set of parameters used in the main text.  Figure M shows the results of 10 experiments, each one with a different random alteration on the biomass reaction. As can be observed, the results are extremely similar, which tells us that the specific values of the coefficients in the biomass reaction are not of critical importance in this case.

Random alterations of the biomass reaction
Next, we decided to randomly set to zero some of the coefficients of the metabolites participating in the biomass reaction. In each experiment, we selected a random number between 0.05 and 0.50 and used it as the probability with which we would set to zero the coefficients of the metabolites participating in the biomass reaction (again, we excluded from these modifications metabolites

h2o[c], atp[c], adp[c], h[c]
and pi [c], linked to ATP maintenance). Figure N shows the results of 10 experiments following this strategy, where there are cases with worsened results. This shows that changing the composition of the biomass does indeed have a strong effect on the results. In this case, dropping some components clearly worsened the results, while dropping others did not have any appreciable consequence. In order to improve the results, it is likely that we will need to include new metabolites into the composition of the biomass reaction equation. Note here that a similar result was found when we determined the proportion of essential genes satisfying the Achilles-based threshold of essentiality, which reinforces the relevance of the selection of the biomass equation. Figure M. KS test p-value results after altering some of the coefficients of the metabolites participating in the biomass reaction up to a ±50%. The experiment consisted of 10 different random variations. The base network used was Recon 2 and the networks were reconstructed from the CCLE samples with matched Achilles experiments using our algorithm with schema 3 and the default parameters used in the main paper.