Integer Programming-Based Method for Designing Synthetic Metabolic Networks by Minimum Reaction Insertion in a Boolean Model

In this paper, we consider the Minimum Reaction Insertion (MRI) problem for finding the minimum number of additional reactions from a reference metabolic network to a host metabolic network so that a target compound becomes producible in the revised host metabolic network in a Boolean model. Although a similar problem for larger networks is solvable in a flux balance analysis (FBA)-based model, the solution of the FBA-based model tends to include more reactions than that of the Boolean model. However, solving MRI using the Boolean model is computationally more expensive than using the FBA-based model since the Boolean model needs more integer variables. Therefore, in this study, to solve MRI for larger networks in the Boolean model, we have developed an efficient Integer Programming formalization method in which the number of integer variables is reduced by the notion of feedback vertex set and minimal valid assignment. As a result of computer experiments conducted using the data of metabolic networks of E. coli and reference networks downloaded from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, we have found that the developed method can appropriately solve MRI in the Boolean model and is applicable to large scale-networks for which an exhaustive search does not work. We have also compared the developed method with the existing connectivity-based methods and FBA-based methods, and show the difference between the solutions of our method and the existing methods. A theoretical analysis of MRI is also conducted, and the NP-completeness of MRI is proved in the Boolean model. Our developed software is available at “http://sunflower.kuicr.kyoto-u.ac.jp/~rogi/minRect/minRect.html.”


Introduction
Metabolism is one of the most important biological processes in organisms. Relations between reactions and chemicals in the metabolism are often represented by metabolic networks [1]. Since many of these metabolic processes can produce commodity and specialty chemicals, the manipulation of metabolisms has been extensively studied in the field of metabolic engineering. One of the most successful applications of metabolic engineering is production of industrially valuable products using a microbial host with recombinant technologies [2][3][4]. Techniques for production of desired chemicals using a microbial host are roughly classified into the following three types [5]: (a)combinations of existing pathways, (b)engineering of existing pathways, and (c) de novo pathway design. In (a), partial pathways can be recruited from independent organisms and co-localized in a single host. For example, 1,3-propanediol is synthesized by Nakamura et al. in which pathways from Saccharomyces cerevisiae and Klebsiella pneumonia were assembled in E. coli [6] and another example is the production of artemisinic acid, a precursor to the plant-based anti-malarial drug artemisinin in yeast [7]. In (b), new non-natural chemicals can be produced by engineering existing routes [8,9]. (c) is realized by the combination of (a) and (b), that is, the recruitment of partial pathways from different species and the use of engineered enzymes for extensions of pathways. It is to be noted that (a) focuses on the topology of the given metabolic networks, while (b) and (c) utilize the information of the structures of chemicals as well.
The ''pathway prediction system'' (PPS) of the University of Minnesota Biocatalysis and Biodegradation Database (UM-BBD) is designed to predict routes for the biodegradation of xenobiotic compounds [10][11][12]. From a set of previously defined biotransformation rules, the PPS guides the user through potential pathways one step at a time, requiring the selection of a new target metabolite at each step [5]. Biochemical Network Integrated Computational Explorer (BNICE) is a computational framework for generating every possible biochemical reaction from a given set of enzyme reaction rules and source or target compounds [13,14].
However, since the number of predicted novel pathways is huge in many cases, some prioritization is necessary to choose the most promiscuous ones [15]. For example, one measure of such prioritization is to minimize the number of enzymatic steps [16].
In the type (a) problem, it seems that there are three major models for judging the producibility of target compounds, that is, connectivity model, flow model, and Boolean model. For each of them, Minimum Reaction Insertion (MRI) problem can be defined for finding the minimum number of additional reactions from a reference metabolic network to a host metabolic network so that a target compound becomes producible in the revised host metabolic network. In the connectivity model such as [16], the producibility of target compounds is judged by the connectivity between the source and the target compounds. After the source and the target compounds are connected by the additional reactions, the producibility is often evaluated by such a flow model as flux balance analysis (FBA) or an elementary mode [17], in which the sum of incoming flows must be equal to the sum of outgoing flows for each compound and the ratio of the amount of substrates and products must satisfy the coefficients given in each chemical reaction formula. In the Boolean model, each reaction occurs if all its substrates are producible whereas each compound is producible if one of its producing reactions occurs [18]. The source compounds are called seeds and the producible compounds are called the scope of the seed. In this model, a Boolean function of ''AND'' is attached to each reaction node and ''OR'' is attached to each compound node in the metabolic networks.
For example, suppose that there is a chemical reaction ''A+BRC+D'', where A and B are called substrates whereas C and D are called products. In the connectivity model, either A or B is necessary to produce C and D, whereas both A and B are necessary for the Boolean model. In the flow model including FBA, in addition to the condition that both A and B must exist, both C and D are necessary to be consumed by other reactions. Thus, each model outputs a different solution for producing desired compounds.
From the view point of computational complexity, although the connectivity model is very simple and then applicable even to very large networks, its logical analysis ability is not strong since it cannot detect the lack of necessary substrates. The good point of the flow model is its computational efficiency since problems in the flow model can often be formalized by linear programming, for which there exist polynomial time algorithms [19]. However, these polynomial time algorithms are not applicable for MRI since discrete variables are necessary for representing additional reactions, although it is solvable by mixed integer programming [20].
Although the computational time of the FBA-based method for MRI is very small and scalable for genome-scale metabolic reconstruction [20], Boolean methods also have attractive features and are expected to complement the FBA-based method. Indeed, for the analysis of metabolic networks, many studies have been conducted to develop Boolean models. For example, Lemke et al. [21] studied the effect of deletion of each enzyme in the metabolic network of a Boolean model, and Smart et al. [22] considered almost the same problem from the viewpoint of the Boolean aspect of the flux balance model. Li et al. [23] and Sridhar et al. [24] have developed methods for finding a set of enzymes whose inhibition stops the production of the target compounds with a minimum elimination of the non-target compounds. Lee et al. [25] and Takemoto et al. [26] estimated the distribution of the size of the effect of the deletions of enzymes using a branching process.
As for the shortcoming of the FBA-based method for MRI, it tends to be considerably affected by the redundancy of the given metabolic network since each node is affected not only by the incoming flows but also by the outgoing flows. For example, suppose that a metabolic network of Fig. 1 (A) is given, where circles and rectangles represent compounds and reactions respectively. In order to produce the target compound from the source compounds, {R1, R2, R3, R4} is necessary in the flow model including FBA, whereas either {R1, R4} or {R1, R2, R3} is sufficient for the Boolean model. Moreover, in the metabolic network of Fig. 1 (B), {R1,R2,R3} is necessary for FBA whereas {R2} is sufficient for the Boolean model. Therefore, in this research, we study the problem of designing a pathway for producing target compounds in metabolic networks of the Boolean model since its logical analysis ability is more stable than that of the FBA, particularly when the flexible parts of the metabolic networks are large. Our approach is based on (a), that is, the combination of existing pathways. In our problem setting, a base metabolic network of a host organism, which we call the host network, is given; it cannot produce the target compound in its initial form. However, an integrated metabolic network of many other organisms are given as the reference network from which we should find the minimum number of additional reactions so that the target compound becomes producible. We prove that this problem is NP-complete.
Although both the FBA-based model and the Boolean model for MRI are considered to be NP-complete, the former is likely to have a faster exponential time algorithm than the latter since FBA has fewer integer variables. Although the computational complexity of the Boolean model is large, we develop an efficient method based on integer programming (IP) [27,28], which is often used as a formalization of NP-complete problems and there is an efficient  free solver for IP called CPLEX [29]. We also conducted four computer experiments in which the metabolic network of E. coli is used as the host network and the reference pathway of the KEGG database [30] is used as the reference network, and propanol, butanol, sedoheptulose 7-phosphate, and maleic acid are used as the target compound in each experiment. The results of the experiments show that (1) our IP-based method can appropriately solve MRI in the Boolean model; (2) solutions of MRI in the Boolean model are more suitable than those by connectivity based methods; (3) our IP-based method is applicable to large-scale networks where an exhaustive search does not work; and (4) solutions of MRI in the Boolean model tend to be smaller than those in the FBA-based model based on [31]. Our developed software is available at ''http://sunflower.kuicr.kyoto-u.ac.jp/ ,rogi/minRect/minRect.html''.

Problem Definition
In this section, the main problem Minimum Reaction Insertion (MRI) in a Boolean model is first explained with an example and then mathematical formalization is described.
Suppose that a metabolic network shown in Fig. 2 is given, where each rectangle (resp., circle) corresponds to a reaction (resp., chemical compound). For example, v r4 is a reaction, its substrates are v c3 and v c5 and its products are v c6 and v c7 . Black circles v c1 and v c2 denote the source nodes and are assumed to be provided by the external environment. On the other hand, a gray circle v c7 represents a target compound and the purpose of MRI is to make the target compound producible. However, initially only the host network, which is shown by the dotted rectangle, is available. Since only v c1 ,v c2 ,v c3 and v r1 are included in the host network, the target compound v c7 is not producible. Instead the entire network is called the reference network and reactions not included in the host network can be added later. In MRI, the minimum number of additional reactions should be determined to make the target compound producible. In this example, the addition of fv r2 ,v r3 ,v r4 g is the optimal solution. The difficult point of MRI is how to deal with the effect of cycles. In the example of Fig. 2, the addition of fv r4 ,v r5 g looks like the optimal solution. However, this solution is not appropriate since it relies on the cycle consisting of fv c6 ,v r5 ,v c5 ,v r4 g and v c7 is not producible unless the initial amount of v c6 is sufficiently large.
MRI is mathematically defined as follows: A metabolic network can be represented by a directed graph G~(V ,E). There are two types of node sets V c and V r , where V c denotes a set of compound nodes and V r represents a set of reaction nodes. V~V c |V r and V c \V r~f g hold. The neighbors of compound nodes must be reaction nodes, and the neighbors of reaction nodes must be compound nodes. Let V s (V c be a set of source nodes and v t [V c be a target node. Source nodes have no incoming edges and correspond to the seed compounds of [18]. In this study, we assume that source nodes are producible at any time. Suppose that a host network G 1~( V 1 ,E 1 ) and a reference network G 2~( V 2 ,E 2 ) are given where G 1 and G 2 are metabolic networks, and G 1 is a subgraph of G 2 induced by V 1 . V 0 c (resp., V 0 r ) is a set of compound nodes (resp., reaction nodes) in V 2 {V 1 and is called the set of additional compound nodes (resp., additional reaction nodes).
Let V a (V 0 r be a set of additional reaction nodes. In the Boolean model, each node is assigned either ''0'' or ''1''. For a compound node, ''1'' means producible and ''0'' means not producible. As for a reaction node, ''1'' means active and ''0'' means inactive. Let A be such an assignment (that is A is a function from V to f0,1g). For each node v[V , we write v~0 (resp., v~1) if 0 (resp., 1) is assigned to v. A is called a valid assignment if the following conditions are satisfied: and u~1 holds for all u such that (u,v)[E. This implies that each reaction node corresponds to an ''AND'' node and each compound node corresponds to an ''OR'' node.
If G 2 has no directed cycles, a valid assignment is uniquely determined for each V a . However, if G 2 has a directed cycle, multiple valid assignments may exist. Let us call v i [V s and v j [V c {V s source connected if there is a directed path from v i to v j , and the values of the nodes included in the path are all 1. There exist valid assignments where the values of nodes in a directed cycle are 1 even if these nodes are not source connected. In order to avoid such a case, we use the notion of minimal valid assignment, which is similar to the notion of maximal valid assignment defined in [32]. A valid assignment A is called minimal if A is valid and fvjv~1,v[V g is minimal with respect to the inclusion relationships for sets. Now we define the Minimum Reaction Insertion as follows: , and a target compound v t .
N Output: A minimum cardinality set of V a for which v t~1 is satisfied in the minimal valid assignment of the induced subgraph of G 2 by V 1 |V 0 c |V a .
As mentioned in the section of Theoretical Results, a minimal valid assignment is uniquely determined if V a is given. However, solving MRI is not easy since the number of candidate V a is 2 jV 0 r j and MRI is proved to be NP-complete. Since utilizing software   packages of Integer Programming (IP) is efficient for solving NPcomplete problems, we develop a method of IP formalization for solving MRI. Since the computational time of the IP-based method is considered to be exponential in terms of the number of variables, it is important to develop an IP formalization of MRI with a small number of variables. To do so, our previously developed method for Minimum Reaction Cut (MRC) [32] may be useful although many modifications are necessary.
MRC is a problem to find a minimum set of reactions that interfere with the production of target compounds [32] and is known to be NP-complete. Let m (resp., n) be the number of compound (resp., reaction) nodes. If we use mzn time steps to calculate the maximal valid assignment in MRC, the number of variables in IP is O((mzn) 2 ). The feedback vertex set (FVS) is a node set whose removal makes a network cycle-free. In [32], we succeeded in reducing the number of variables to O(f (f zmzn)), where f is the size of the feedback vertex set and f is considerably smaller than m or n. If use of O((mzn) 2 ) variables is allowed in MRI, almost the same method as in MRC can be used. However, to reduce the number of variables in IP to O(f (f zmzn)), many modifications are necessary since minimal valid assignment and maximal valid assignment have different features.

Integer Programming-Based Method for Minimum Reaction Insertion
Here, we show IP formalization methods for MRI in the Boolean model. To apply IP, problems must be formalized to maximize or minimize a given objective function which is a linear function of integer variables and constraints must also be given as linear equations or inequations of integer variables. Suppose that the host network and the reference network are given as shown in Fig. 2. The simplest IP formalization IP-MRI-A for solving Minimum Reaction Insertion is as follows where the time step increases by 1 when the Boolean calculation is synchronously conducted for every node: Subject to TC7 mzn ð Þ~1 ð2Þ TXzFX~1 for any X where every variable takes either 0 or 1. v ri~1 (resp., v ri~0 ) at time step t is represented by TRi(t) = 1 (resp. FRi(t) = 1) and TRi(t)+FRi(t) = 1 holds for any i and t. For example, TR2(1) = 0 means that v r2~0 at time step 1, and FR2(1) = 1 automatically holds at the same time. In the implementation, FRi(t) is replaced with 1-TRi(t) to reduce the number of variables. Similarly, the values of compound nodes are represented by TCi(t) and FCi(t).
For a compound node with indegree 1, the value of the predecessor node is just copied. For example, since v c3 has only one predecessor v r1 , v c3 (tz1) is just copied from v r1 (t) as shown in (8). Similarly, v c4 (tz1) is just copied from v r2 (t) as shown in (9).
For a compound node with indegree more than 1, it is necessary to convert the ''_'' relation into linear equations or equations. (10) represents the Boolean relation , and then (10) is obtained.
As for the reaction nodes not included in the host network, TERi(t) and FERi(t) are used to represent whether v ri is activated. We use a virtual node v ei as one of the predecessors of v ri . Since v ri is represented by an AND node, v ei~0 keeps v ri inactive even if all other predecessors of v ri are 1. For example, v r2 in Fig. 2 has only one predecessor v c1 . However, since v r2 is not included in the host network and v ei~1 is necessary for v r2~1 , v r2 (tz1)~v c1 (t)^v e2 (t) must hold, and then (4) is obtained.
Since we assume minimal valid assignment, at t~0, the source compound nodes are assigned 1, but the other compound nodes and reaction nodes are assigned 0.
mzn is the largest number of time steps necessary for the 0-1 assignment to converge. (1) means that the number of additional reactions should be minimized. (2) means that the target compound v c7 should become 1 after the 0-1 assignment converges. (3)-(7) represent the constraints by v r1 to v r5 respectively. Note that v e1 does not exist since v r1 is included in the host network and then v r1~1 holds for any V a . (8)- (12) represent the constraints by v c3 to v c7 respectively. (13) represents that V a does not change by time transition. (14) means that v c1 and v c2 are source nodes. (15)- (16) represent that all nodes but source nodes are assigned 0 in the initial state. (17) means that ''T'' and ''F'' represent ''true (1)'' and ''false (0)'' respectively, and complement each other.
The above formalization can clearly solve MRI and obtain the correct solution V a~f v r2 ,v r3 ,v r4 g, however O((mzn) 2 ) variables are necessary. To reduce the number of variables, it is necessary to reduce the number of time steps. If time is not taken into account at all, the following inappropriate IP formalization IP-MRI-B is obtained.  In IP-MRI-B, the solution of IP is V a~f v r4 ,v r5 g since (v r1 , . . . ,v r5 ,v c1 , . . . ,v c7 )~ (1,0,0,1,1, 1,1,1,0,1,1,1) is a valid assignment and satisfies v c7~1 . Note that v r2 and v r3 are forced to be 0 since they are not included in either the host network or V a . Although it satisfies all constraints and jV a j is minimum, this assignment is not appropriate since fv r4 ,v c6 ,v r5 ,v c5 g forms a cycle and all of them are assigned 1 without the influence of source nodes. To avoid such an inappropriate assignment, it is necessary to consider minimal valid assignment with respect to the number of 1 s for each V a . As shown in the section of Theoretical Results, the minimal valid assignment is uniquely determined for each V a .
Thus, IP-MRI-A can solve MRI, but mzn time steps are necessary, while IP-MRI-B, which does not use the notion of time, cannot solve MRI. The feedback vertex set (FVS) is a set of nodes whose removal makes the network acyclic. Since IP-MRI-B can solve MRI if there is no cycle, it is reasonable to apply IP-MRI-B for the acyclic network obtained by the deletion of FVS and use the notion of time as in IP-MRI-A to nodes included in F based on the idea developed in [32].
In the improved method, IP-MRI-C, we firstly find an FVS F consisting of reaction nodes and then decompose each v ri [F into two nodes v ri and v si so that v ri has only in-edges and v si has only out-edges. For example, in the network of Fig. 2, since F~fv r4 g is a feedback vertex set, v r4 is decomposed into v r4 and v s4 as shown in Fig. 3. Furthermore, we put an additional constraint that v si (tz1)~v ri (t). The number of time steps of IP-MRI-C is f z1 while that of IP-MRI-A is mznz1, where f~jF j. Therefore, the numbers of variables in IP-MRI-C and IP-MRI-A are O(f (mznzf )) and O((mzn) 2 ) respectively. Since f is considerably smaller than mzn in most metabolic networks and the computational time of IP exponentially increases with the number of variables, we can expect a significant improvement from the view point of the computational time.
Although finding the minimum FVS is known to be NPcomplete, it is not necessary to use the minimum FVS in our problem setting. We use a simple greedy algorithm to choose FVS as follows: Procedure GreedyFVS(G~(V ,E)), where V~fv 1 , . . . ,v n g and E~fe 1 , . . . ,e m g for i~1 to n do v½i~0; for i~1 to m do e½i~0; i/1; while there exists e k [E such that e½k~0 v½i~1; if there exists e k~( v i ,v j ) such that e½k~0 and v½j~1 then E 0 /E{fe k g; Since the reaction nodes for FVS are chosen by a greedy algorithm, the size of FVS is not always optimal. However, it is important to note that even if the size of FVS is not optimal, the solution of MRI calculated by IP-MRI-C is always optimal. If there are multiple optimal solutions in MRI, there is a possibility that the solutions are different since IP outputs only one solution. However, it may be possible to enumerate all optimal solutions of MRI by iteratively solving IP with a constraint to avoid the already chosen solutions.
For example, IP-MRI-C for Fig. 2 is as follows, where v r4 is decomposed into v r4 and v s4 , and time step increases by 1 only when the value of v r4 is copied to v s4 .
Subject to TXzFX~1 for any X ð49Þ where each variable takes either 0 or 1.  (14), and (17), respectively. fv r4 g is chosen as a feedback vertex set, and then decomposed into v r4 and v s4 as shown in Fig. 3.
Note that the number of time steps is 2 = f +1, and TSR4(t) = 1 represents v s4 = 1. In (42)-(43), the constraints for v c6 and v c7 are represented by the variable corresponding to v s4 instead of that to v r4 . (45) represents that the time step increases by 1 when the value of v r4 is copied to v s4 in Fig. 3. (48) represents v s4 (0)~0 to obtain the minimal valid assignment.
Additionally, if we use the FVS-based method and no cycles are included in G 1 and G 2 , the number of necessary time steps is only one. For example, suppose that G 1 and G 2 are as shown in Fig. 4 (A). In this case, the correct solution of MRI is V a~f v r2 ,v r3 ,v r4 ,v r5 g. However, if we set TC1(0) = 1 and TC6(0) = 0, IP can output no solution since the condition TC6(0) = 1 is never satisfied. On the other hand, if we set TC1(0) = 1 and TC6(0) = 1, an inappropriate solution V a~f g is obtained by IP. To avoid such a case, in our method, if one of the predecessors of an additional reaction node v r is included in the host network, we decompose v r as if it were included in FVS. For example, in the network of Fig. 4 (A), v r2 is decomposed into v r2 and v s2 as shown in Fig. 4 (B) so that the values of the source nodes and the target node are calculated in different time steps.

Computer Experiments
We conducted computer experiments for solving MRI with data downloaded from the KEGG database. The experiment was conducted on a PC with an Intel(R) Xeon(R) 3.33 GHz CPU and 10 GB RAM having the SUSE Linux (version 12.2) operating system, where CPLEX (version 12.4.0.0) was used as the solver of integer programming.
In this study, a reference network consists of the central metabolism and the related modules necessary for producing the target compound. A map of the KEGG PATHWAY is a minimum unit, and three or four maps of the KEGG PATHWAY are chosen and integrated as the reference network in each of our experiments. As for species, a reference network includes the chemical reactions of all species, whereas the metabolic networks of E. coli are used for the host networks. The major difference between the pathway alignment methods by KEGG and our developed method is that our method is based on a Boolean model, whereas the pathway alignment methods consider only the topology of networks.
In synthetic biology, it is of great interest to construct a minimal genome that realizes the desired functions [33][34][35]. Since glycolysis, gluconeogenesis, citrate cycle and pentose phosphate pathway are considered to be essential even in artificial organisms, it is reasonable to assume that the host networks in the computer experiments have some of these pathways in one of the simplest organisms, E. coli. Because the purpose of this study is not focused on the reconstruction of genome-scale metabolic network model, but the design of a minimal genome in addition to the existing pathways to produce a desired compound, each reference network consists of the maps of the KEGG pathway located between the central metabolism and each target compound.
In the first computer experiment, the target compound is propanol (C00479 in KEGG ID), the host network is glycolysis and gluconeogenesis of E. coli (eco00010.xml), and the reference network covers glycolysis, gluconeogenesis and glycerolipid metabolism of other species (ko00010.xml and ko00561.xml). The numbers of compound and reaction nodes are 58 and 85, respectively, where 30 reactions are reversible. The source nodes are D-glucose (C00031), oxaloacetate (C00036), salicin (C01451), arbutin (C06186), UDP-glucose (C00029), acyl-CoA (C00040), and diglucosyl-diacylglycerol (C06040), which are represented by black circles in Fig. 5. It took 0.19 s to solve MRI. The obtained additional reactions are V a~f R01514, R01752, R01036, R01048, R02577, R02376g, where these reactions produce propanol from 3-phospho-D-glycerate (C00197) via glycerol (C00116) as shown in Fig. 5. Since 3-phospho-D-glycerate (C00197) is producible by glycolysis and gluconeogenesis of E. coli and works as a connection between glycolysis and glycerolipid metabolism, the obtained V a can be considered an appropriate solution of MRI.

Difference between Developed Model and Shortest Path-Based Model
To show the difference between the developed model and the shortest path-based models, we conducted the second experiment where PathComp of KEGG (''http://www.genome.jp/tools/ pathcomp/'') was used to calculate the solution of the shortest path-based model. In the experiment, the host network consists of glycolysis, gluconeogenesis and citrate cycle of E. coli (eco00010.xml and eco00020.xml), and the reference network consists of glycolysis, gluconeogenesis, citrate cycle and pentose phosphate pathway of other species (ko00010.xml, ko00020.xml and ko00030.xml). The numbers of compound and reaction nodes are 64 and 108, respectively, where 59 reactions are reversible. There are four source nodes, D-glucose(C00031), arbutin(C06186), salicin(C01451), and acetate (C00033), and the number of candidates for the additional reactions is 66. When the target compound is sedoheptulose 7-phosphate (C05382), as shown in Fig. 6, the solution of MRI is V a~f R01827, R01830g, where the substrates of R01827 are beta-D-fructose 6-phosphate (C05345) and D-erythrose 4-phosphate (C00279). It took 32.58 s to obtain the solution. Since D-erythrose 4-phosphate (C00279) is not included in the host network, it is necessary to add R01830 in which substrates are beta-D-fructose 6-phosphate (C05345) and D-glyceraldehyde 3-phosphate (C00118) and the products are Dxylulose 5-phosphate (C00231) and D-erythrose 4-phosphate (C00279). It is to be noted that both beta-D-fructose 6-phosphate (C05345) and D-glyceraldehyde 3-phosphate (C00118) are producible by the host network.
On the other hand, PathComp just connects the producible compounds and the target compound adds only R01827 since R01827 is adjacent to both beta-D-fructose 6-phosphate (C05345) and sedoheptulose 7-phosphate (C05382). However, it is clear that R01827 does not occur if D-erythrose 4-phosphate (C00279) does not exist. Thus the difference between the shortest path-based method and the developed method is that the developed method considers Boolean constraints for each reaction and compound whereas the shortest path-based method only considers the connectivity of nodes.

Scalability
Next, we conducted the third experiment to show the scalability of our method. The host network consists of the source nodes of glycolysis and gluconeogenesis of E. coli (eco00010.xml), that is, Dglucose(C00031), arbutin(C06186), salicin(C01451), oxaloacetate(C00036) and acetate (C00033). The reference network consists of glycolysis, gluconeogenesis, citrate cycle, pentose phosphate pathway and butanol metabolism of other species (ko00010.xml, ko00020.xml, ko00030.xml and ko00650.xml), where R01172 is treated as a reversible reaction. The target compound is butanol (C06142). The numbers of compound and reaction nodes are 93 and 150, respectively, where 87 reactions are reversible. It took 919.79 s (15m19s) for the developed method to solve MRI and the solution was V a~f R00235, R00238, R01977, R03027, R01171, R01172, R03545g. These seven reactions form a path from acetate to 1-butanol via acetyl-CoA, acetoacetyl-CoA, crotonoyl-CoA and butanoyl-CoA, which satisfies the Boolean constraints. Since the number of reactions in the reference network is 150, it is necessary to examine 150 C 7 cases if an exhaustive search is conducted. Since examining 150 C 7^2 :941|10 11 cases is almost impossible, it is seen that the IP-based method is useful for solving MRI, particularly when the given networks are not small.

Difference between Developed Model and FBA-Based Model
Finally, we conducted an experiment to show the difference between the developed model and the FBA-based model. We assume that the reference network consists of glycolysis, gluconeogenesis, citrate cycle, pentose phosphate pathway and butanol metabolism of other species (ko00010.xml, ko00020.xml, ko00030.xml and ko00650.xml), and the host network includes only one reaction R04394 between salicin (C01451) and salicin 6phosphate (C06188) as shown in Fig. 7. Therefore, the source node is only salicin (C01451). Note that reversible reactions are decomposed into two reactions, and denoted by P and Q. The target compound is maleic acid (C01384

Theoretical Results
Although solving IP is NP-complete, a problem that can be formalized as IP is not always NP-complete. Therefore, in the following paragraphs, we prove that MRI is NP-complete and show the appropriateness of formalizing MRI of the Boolean model as IP.
Theorem 1: Minimum Reaction Insertion is NPcomplete even when the maximum indegree and outdegree are bounded by 2.
Proof: Since the problem is clearly in NP, it suffices to show NPhardness. The proof is by a polynomial time reduction from minimum vertex cover (MVC), which is a problem for a given graph to find the minimum number of nodes so that each edge is incident to at least one of the selected nodes. For example, for the graph shown in Fig. 8 (A), fv 1 ,v 5 ,v 6 g is an optimal solution of MVC.
In the following paragraphs, we show that MVC for G has a solution of size z if and only if MRI has a solution in which jV a j~zz1 holds. When G has a vertex cover of size z, V a~f r i jv i is included in the vertex coverg|fr t g satisfies c t~1 in the minimal valid assignment and jV a j~zz1 holds. On the other hand, suppose that V a satisfies c t~1 in the minimal valid assignment and jV a j~zz1. Since r t [V a is necessary for c t~1 , z nodes are included in V a from fr 1 , . . . ,r n g. Since each c j must be 1 to satisfy r t~1 , at least one predecessor of each c j must be included in V a for each j. Since there is an edge between r i and c j if and only if v i is incident to e j , fv i jr i [V a g is a vertex cover of size z. Nodes whose degrees are more than 2 can be converted by the methods shown in Fig. 9.
Theorem 2: Given a host network, a reference network and a set of additional reactions, a minimal valid assignment is uniquely determined.
Proof: For any valid assignment A, the assignment obtained by assigning 0 to all nodes that are not source connected is also a valid assignment. On the other hand, for any valid assignment A, the assignment obtained by assigning 0 to a source connected node is not a valid assignment. Since source connected nodes V sc are uniquely determined for V a , V sc is a minimal valid assignment and uniquely determined.

Discussion
In this paper, we formalized an optimization problem MRI in a Boolean model with a notion of minimal valid assignment. We proved that MRI in the Boolean model is NP-complete and the minimal valid assignment is uniquely determined when V a is given. Since an exhaustive search cannot be used to solve MRI when the given networks are not small, we developed an IP-based method for MRI. To improve the scalability of the developed method, it is necessary to reduce the number of variables appearing in IP formalization since the computational time of IP is considered to be exponential to the number of variables. Although the simple IP formalization with the notion of time is useful for solving MRI, it needs O((mzn) 2 ) variables in IP formalization. If the notion of FVS is used, the number of necessary time steps reduces to f , where f denotes the size of FVS, and the number of variables in IP is O(f (mznzf )). Although the idea of using FVS is similar to [32], many modifications are necessary since the minimal valid assignment and the maximal valid assignment have many different properties.
We also conducted four computer experiments in which data were downloaded from the KEGG database, CPLEX was used as the IP solver, and propanol, butanol, sedoheptulose 7-phosphate, and maleic acid were used as the target compound for each experiment. The host network was a metabolic network of E. coli and the reference network of KEGG was used as the reference network. The results of the computer experiments confirmed the correctness and the scalability of the developed method, and the appropriateness of the problem setting of MRI.
An important advantage of our Boolean model is its capability of detecting the lack of substrates, whereas the connectivity-based methods cannot appropriately handle this point. An extended type of connectivity-based method is BNICE, which enumerates all possible pathways from the source nodes to the target compound, and uses thermodynamical feasibility and pathway length to evaluate each candidate pathway. In contrast, the developed method evaluates each candidate pathway based on the number of additional reactions. Another advantage of the developed model is its capability of handling branches and/or cycles in a pathway from the source compounds to the target compound, whereas BNICE considers only the non-branching paths. However, since BNICE nicely evaluates each pathway by the thermodynamic free energy of the included compounds and length, considering the thermodynamic free energy in a Boolean model represents an important direction of our future work.
It is to be noted that the solution of MRI in the FBA-based model is different from that in the Boolean model. In particular, if the reference network includes a large redundant part, the FBAbased model tends to output a larger solution than the Boolean model, although the FBA-based model is very fast when compared to the Boolean model. Therefore, one of our future works is to develop a hybrid method combining the FBA-based method and the Boolean-based method. Petri-net-based methods [36] are also interesting since they may extract the good points of both Booleanbased methods and FBA-based methods.