A Method for Finding Metabolic Pathways Using Atomic Group Tracking

A fundamental computational problem in metabolic engineering is to find pathways between compounds. Pathfinding methods using atom tracking have been widely used to find biochemically relevant pathways. However, these methods require the user to define the atoms to be tracked. This may lead to failing to predict the pathways that do not conserve the user-defined atoms. In this work, we propose a pathfinding method called AGPathFinder to find biochemically relevant metabolic pathways between two given compounds. In AGPathFinder, we find alternative pathways by tracking the movement of atomic groups through metabolic networks and use combined information of reaction thermodynamics and compound similarity to guide the search towards more feasible pathways and better performance. The experimental results show that atomic group tracking enables our method to find pathways without the need of defining the atoms to be tracked, avoid hub metabolites, and obtain biochemically meaningful pathways. Our results also demonstrate that atomic group tracking, when incorporated with combined information of reaction thermodynamics and compound similarity, improves the quality of the found pathways. In most cases, the average compound inclusion accuracy and reaction inclusion accuracy for the top resulting pathways of our method are around 0.90 and 0.70, respectively, which are better than those of the existing methods. Additionally, AGPathFinder provides the information of thermodynamic feasibility and compound similarity for the resulting pathways.


Introduction
Finding and analyzing metabolic pathways that may span multiple organisms helps biologists to understand the metabolism, reconstruct metabolic network and discover candidate pathways for synthesis of useful biomolecules [1,2]. The quantity and quality of metabolic data has greatly increased in the last decades [2], for instance, the metabolic databases KEGG (Kyoto Encyclopedia of Genes and Genomes) [3] and MetaCyc [4] had an exponential growth. Research on metabolic pathways on this vast quantity of metabolic data requires new computational methods in order to find and analyze biochemically relevant metabolic pathways [2]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Such computational methods can be a powerful means in discovering novel or alternative metabolic pathways that could not have been found manually [2]. Therefore, it is important to utilize novel computation methods to search and analyze alternative metabolic pathways in genome-scale database.
The efforts on studying metabolic pathways can be divided into two complementary types [5], namely stoichiometric methods and graph-based pathfinding methods. Stoichiometric methods build stoichiometry-balanced optimization models based on integer linear programming (ILP) to search for the metabolic pathway that transforms a source metabolite to a target metabolite with high yield. The stoichiometric methods are well-defined in mathematics and enable biotechnological analysis of pathways to increase the yield of important metabolites [6]. A number of the stoichiometric methods, such as CFP [7][8][9], PathTracer [10], OptStrain [11], OptStoic [12], NCGA [13], and RetroPath [14], has been proposed.
Graph-based pathfinding methods find possible metabolic pathways converting a given start compound to a given target compound through the connectivity of the reactions and the compounds in the metabolic networks. Some commonly used graph-based methods search metabolic pathways based on machine learning [15,16], evolutionary algorithms [17,18], tailored heuristic search strategy [5,19,20], retrosynthetic model [21][22][23][24], minimized pathway switching [25], and subgraph extraction technique [26]. Graph-based pathfinding methods complement stoichiometric methods as they focus on different aspects of modeling and understanding metabolism [2,[27][28][29]. Most stoichiometric methods search the pathways that obey the pseudo steady-state constraint, and therefore require assigning the internal and external metabolites [2]. This may lead to failing to find those feasible biochemical pathways that do not obey the pseudo steady-state constraint. Moreover, how to accurately assign the external metabolites which are excluded from the pseudo steady-state constraint remains a challenge [2,27,30]. However, both types of methods are important ways for searching and analyzing metabolic pathways [2].
A significant feature of previous graph-based pathfinding methods is that these methods select reactions and compounds based on the connectivity. However, in most cases, chemical reactions usually contain cofactors and hub metabolites such as ATP, NAD, H 2 O and H + [31]. Such highly connected hub metabolites often occur in the shortest paths, and the shortest path between two compounds in metabolic networks is not always a biochemically meaningful pathway [32][33][34]. A possible solution to overcome the problem of hub metabolites is to remove hub metabolites from the metabolic network [19,35,36]. But this solution requires the user to have specialized knowledge and experience and to manually curate the networks. Moreover, if hub metabolites are removed, it is impossible to find the pathways synthetizing these compounds. Some methods were proposed to solve this problem by adding weights based on the degree of the nodes [37][38][39] or using structural similarity between compounds [40,41] to guide the search of pathways. However, this does not completely avoid spurious connections occurring in the found pathways.
By providing a specific mapping from the atom in the input compounds to the atom in the output compounds of a reaction, atom mapping data offers a systematic way for understanding biochemical reactions [2]. In the past few years, the quantity and quality of atom mapping information have been steadily increasing, with one of the main sources being the KEGG RPAIR database [3,42]. Recently, people use atom mapping data to avoid spurious connections when searching pathways [32,43]. Based on the observation that the same atom-mapping pattern between two compounds often appears in multiple reactions [32], some researchers [44,45] utilized atom mapping data to find metabolic pathways by allowing only connections through reactions where at least one atom is being transferred from the input to the output compounds. However, the pathways that conserve the atoms from start to target compounds will be more biochemically relevant [46]. Some pathfinding methods using atom tracking have been developed to find such pathways. ReTrace [30] and LPAT [2,46] use atom mapping information from the KEGG RPAIR database to search metabolic pathways that conserve at least a given number of atoms from the start to the target compounds, their experimental results showed that atom tracking significantly improves the performance of metabolic pathfinding. Different from the methods using atom mapping data from databases, MetaRoute [1,6] automatically computes atom mapping rules based on enzyme EC numbers and compound SMILES, and uses the computed atom mapping data to avoid finding pathways that lose all conserved atoms from the start compound. MetaRoute correctly returned textbook-like routes, e.g. it recovered a major part of glycolysis. RouteSearch [47] uses a branch-and-bound algorithm to compute the optimal metabolic pathways, where optimality is based on the number of reactions used, the provenance of the reactions and the atoms conserved by the route from the start to target compounds. RouteSearch successfully found the known pathways with a larger efficiency than previous methods.
Heath et al. [46] pointed out that atom tracking is a very important feature in finding meaningful metabolic pathways since it essentially excludes spurious connections and reactions that do not correspond to useful or real biochemical pathways. However, in order to track the movements of target atoms, the pathfinding methods using atom tracking require the user to define the specific atoms to be tracked in advance. This may lead to failing to predict the pathways that do not conserve the user-defined atoms.
A synopsis on the pathfinding tools for metabolic pathway is listed in Table 1.
In this article, we present a pathfinding method called AGPathFinder to find biochemically relevant metabolic pathways. Our method differs from the atom tracking methods by tracking the movement of atomic groups through metabolic networks and implementing a shortestpath-based algorithm that uses combined information of reaction thermodynamics and compound similarity both to direct its search for the pathways between two desired compounds and to rank the resulting pathways. Atomic group tracking enables our method to avoid hub metabolites and search pathways without requiring the user to define the atoms to be tracked. Meanwhile, atomic group tracking, when combined with the information of reaction thermodynamics and compound similarity, can further improve the quality of the found pathways. The experimental results show that AGPathFinder is capable of finding both known pathways and thermodynamically feasible alternative pathways. Compared with other previous methods, our method finds alternative pathways with a higher accuracy and lower error in genome-scale database.
The remaining of the article is organized as follows. Section "Method" introduces the weighted atomic group transfer graph and presents our method AGPathFinder. Section "Results" describes the experimental setup and study cases, compares the results with other existing methods. Section "Discussion and Conclusion" concludes the article.

Atomic group transfer graph
Zhou and Nakhleh argued that two atoms in a compound are considered to be in the same atomic group if they are linked by covalent bond(s) that does not break during the chemical reactions [45]. Accordingly, an atomic group is a group of atoms transferred between a substrate and a product in the reaction, where the covalent bonds between the atoms in the group do not break during the reaction. The size of an atomic group is determined by the number of atoms in the group. Due to the fact that any atom could be a member of an atomic group, we can use atomic groups, instead of specific atoms, as the targets and track the movements of atomic groups through metabolic networks to find biochemically relevant pathways. This will not require the user to define the specific atoms to be tracked and allows the user to find pathways without even knowing the atoms of the compound. Moreover, since the amount of chemical content is measured in terms of the number of functionally independent atomic groups instead of the absolute number of non-hydrogen atoms [45], the pathways that conserve at least one atomic group from the start to target compounds will be more biochemically relevant. During the pathway inference, a conserved atomic group in the pathway is a group of atoms transferred from start compound to current compound, where the covalent bonds between the atoms in the group do not break during the reactions in the pathway.
In this work, we use the atom mapping data of reactions in the KEGG RPAIR database to compute the atomic group transferred between reactants and products. Each KEGG RPAIR entry contains the structural information for each compound, an alignment mapping atoms between the two compounds, and a list of associated reactions [42,46]. The KEGG RPAIR data do not contain typical molecular symmetry information. If a compound is known to be symmetric, a new atom mapping entry can be generated to illustrate symmetry of the molecules [46]. When we need to process symmetry of the molecules, we only add those atom mapping entries explicitly appeared in the KEGG RPAIR data.
A chemical compound can be represented as an attributed relational graph K, whose set of nodes V(K) correspond to atoms and set of edges E(K) correspond to chemical bonds [49]. A Table 1. A synopsis on the pathfinding tools for metabolic pathway.
An atomic group transfer graph can be represented as a directed metabolite graph, whose nodes are compounds and edges represent reactions linking an input compound and an output compound. Each edge contains at least one atomic group transferred from the input compound to the output compound.  of compound C00065 and the bonds between these atoms form an atomic group that is transferred from C00065 to C00398 through R00674 and R00685.
During the pathway inference, we use the reactions and compounds that contain conserved atomic groups to construct an atomic group transfer graph from start to target compounds. Then we search biochemically relevant pathways that transfer the conserved atomic groups from start compound to target compound in the graph.

Weighting schemes
In addition to using atomic group tracking to find biochemically relevant metabolic pathways, we also introduce the weighting schemes based on the associated context-specific knowledge including reaction thermodynamics and structural similarity between reactant and product. We can use such weighting schemes to guide the search process towards more feasible pathways and better performance, and to find meaningful pathways even without the option of tracking atomic groups. In our weighting schemes, each edge in an atomic group transfer graph will be assigned with a weight that reflects the impact of the reaction thermodynamics and the structural similarity between reactant and product on the alternative pathways.

Thermodynamic information on reactions
Gibbs free energy is usually used to determine whether a reaction or metabolic pathway is thermodynamic feasible [50]. We use DG 0 r to denote the Gibbs free energy change of reactions in KEGG RPAIR database. The corresponding values of DG 0 r of the reactions are obtained from the literature [50], which are downloaded from "Group Contribution Data" in the table "Reaction Energies" at http://equilibrator1.milolab.webfactional.com/download. The value of DG 0 r of a reaction is an essential part of the edge weight and it also provides a means of ranking the results. For example, from Fig 2 we can see that R00674 and R00685 are represented as two edges of an atomic transfer graph. The values of DG 0 r of R00674 and R00685 are -64.5 and -19.9. These values of DG 0 r are used as a part of the weights for R00674 and R00685 (for more details see section "Weight computation"). In the process of finding candidate pathways in atomic group transfer graph, we can calculate the sum of the weights of all edges of each pathway from the start to target compound, and rank the pathway by the sum (for more details see section "Constructing atomic group transfer graph and finding candidate pathways"). User can analyze thermodynamic feasibility of each pathway by the values of DG 0 r of reactions. In this article, the values of DG 0 r of reactions under the conditions of pH = 7.0, ionic strength = 0.1, and T = 298.15K are downloaded from "Group Contribution Data" in the table "Reaction Energies" at http://equilibrator1.milolab.webfactional.com/download.

Compound similarity
In addition to reaction thermodynamics, the structural similarity between two compounds is widely used to measure the diversity of the chemical space and analyze the metabolic networks [20,41,51]. For example, the SMSD tool [51] has been applied to compute the structural similarity between two compounds. In this article, we use SMSD to compute the similarity scores between the input compounds and output compounds in all reactions in a pathway. This similarity score is used as a part of the edge weight to guide the search process, which will be further described in the section "Weight computation".

Weight computation
AGPathFinder uses the combined information of structural similarity between compounds and reaction thermodynamics to weight the edges, and AGPathFinder moves to the edges that are more thermodynamically favorable and/or link with more structurally similar nodes. Given an atomic group transfer graph G ag = (V ag , E ag ) with node set V ag and edge set E ag , nodes v i and v j 2V ag denote two compounds in G ag . An edge e ij 2E ag linking v i and v j represents a reaction r ij , where reaction r ij contains the atomic group transferred between compounds v i and v j . We represent the compound similarity between v i and v j by sim(v i ,v j ) and the DG 0 r of reaction r ij by fe(r ij ), and compute the weight W ij of edge e ij as follows: where α is a parameter adjusting relative weights of compound similarity and Gibbs free energy, and the constants 3200 and 10000 are used to normalize the value of fe(r ij ). In Eq (1), the value of sim(v i ,v j ) is between 0 and 1, and the value of fe(r ij ) downloaded from the table "Reaction Energies" at http://equilibrator1.milolab.webfactional.com/download is between 10194.7 and -2233.7. That is to say, the difference between sim(v i ,v j ) and fe(r ij ) is very large. The normalization of fe(r ij ) in Eq (1) adjusts the value of fe(r ij ) to the range [0.09663, 1.33947] and brings the values of sim(v i ,v j ) and fe(r ij ) into alignment. In Fig 2, according to Eq (1), when α = 0.5, the values of weight W ij for reactions R00674 and R00685 are 0.469275 and 0.259005 respectively; when α = 1, the weight W ij only depends on the similarity between v i and v j , and the values of W ij for R00674 and R00685 are 0.625 and 0.2 respectively; when α = 0, the weight W ij only depends on DG 0 r of reaction r ij , and the values of W ij for R00674 and R00685 become 0.31355 and 0.31801 respectively.

Constructing atomic group transfer graph and finding candidate pathways
To construct an atomic group transfer graph from the start compound to the target compound, we need to compute the information for the atomic group transferred from substrate to product through reaction. Given substrate G and product H in reaction R and a user-specified size of atomic group, the following CAGM algorithm finds all conserved atomic groups of the user-specified size or larger transferred from G to H through reaction R [Algorithm 1].
Algorithm 1: CAGM Input: substrate G in reaction R, product H in reaction R, conserved atomic group set R g of G from start compound, edge mapping h for reaction R, user-specified size L of atomic group; Output: conserved atomic group set S of H; subgraph M of H; Initially, if G is a start compound, then G is the only molecular in R g . Let h be an edge mapping from G to H. At the beginning, CAGM finds all mapping edges from R g to H by using h, and then uses these edges to construct subgraph M of H (lines 2-5). For each unvisited node m in M, the connected component MC containing m in M is determined by the depth-first search algorithm (lines 6-7), and each node in MC is then marked as visited (line 8). If the number of nodes in MC!L, then MC is added to S (lines 9-10). Repeat this procedure until all nodes in M are visited.
In the following we use an example to explain the algorithm in finding the conserved atomic group transferred from substrate to product through reaction R02722 in Fig 1. Example 2.1: Compound C00065 is the substrate G and compound C00078 is the product H. Let L = 2. The partition encircled with dotted line in C00065 is the conserved atomic group transferred from a start compound to C00065. This conserved atomic group constructs the conserved atomic group set R g of C00065. At the beginning, we find all mappings of the edges {e1,e2} of R g in C00078 by edge mapping h:{e1!f1, e2!f2, e3!f3, e4!f4, e5!f5}, and the resulting edge mappings are {e1!f1, e2!f2}. We then use f1 and f2 to construct a subgraph M of C00078. For each unvisited node m2{O1t,C2t,O3t} in M, we find the connected component MC containing m in M by a depth-first search algorithm, and mark each node of MC as visited. From Fig 1, we can see that the atom set {O1t,C2t,O3t}V(C00078) and the bond set {f1,f2}E(C00078) form the MC. It is obvious that the number of nodes in MC!2, thus MC is added to S, the algorithm terminates here since all nodes in M are visited.
The algorithm CAGM finds potential atomic groups transferred from substrates to products through reaction. Given start compound Sm and target compound Tm, the following CAGTG algorithm creates a weighted atomic group transfer graph G ag between Sm and Tm, and finds the top k-shortest paths C p with the smallest weight from Sm to Tm in G ag [Algorithm 2].
Algorithm 2: CAGTG Input: start compound Sm, target compound Tm, conserved atomic group set R g from start compound, Boolean vector ψ(Sc,Td), where Sc denotes compound similarity, Td denotes thermodynamic feasibility; Output: weighted atomic group transfer graph G ag between Sm and Tm, top k-shortest paths C p with the smallest weight from Sm to Tm in G ag ;

8.
Compute the conserved atomic group set S transferred from v i to v j by using algorithm CAGM; 9.
Mark v j as visited; 10.
If S is not empty then 11.
Compute the weight of edge Concatenate Replace the conserved atomic group set R g of v j with S. end if end for end if end while 15. Determine the top k-shortest paths C p with the smallest weight between Sm and Tm in G ag .

Return C p
In an iterative manner, algorithm CAGTG removes node v i from queue Q (lines 4-5), where Q is the set of candidate nodes and these candidate nodes are used to create G ag . If node v i is not the target compound (line 6), for each unvisited node v j adjoining to v i , CAGTG executes algorithm CAGM to compute the conserved atomic groups transferred from v i to v j (lines 7-8), and mark node v j as visited (line 9). If S is not empty (line 10), CAGTG computes the weight of edge (v i , v j ) by (Eq (1) in Section "Weight computation")according to the value of ψ(Sc,Td) (lines 10-11), add node v i and edge (v i , v j ) to G ag (line 12), put node v j in Q (line 13), and replace the conserved atomic group set R g of v j with S (line 14). CAGTG repeats this procedure until Q is empty. When Q is empty, the atomic group transfer graph between Sm and Tm has been created. Finally, CAGTG has found the top k-shortest paths C p with smallest weight from Sm to Tm in G ag as candidate paths (lines 15-16).
Our algorithm CAGTG provides two user-defined searching parameters Sc and Td, which allow the user to manipulate the parameter α in Eq (1) to guide the search for specific pathways of interest. For example, when we want to find the pathways that consist of reactions with low DG 0 r , we can set ψ(Sc, Td) = ψ(false, true). If ψ(Sc, Td) = ψ(false, true), AGPathFinder uses α = 0 and it means that the weight of edge in Eq (1) is determined by Gibbs free energy and the search will be driven by Gibbs free energy. When we want to find the pathways that consist of similar compounds, we can set ψ(Sc, Td) = ψ(true, false). If ψ(Sc, Td) = ψ(true, false), AGPath-Finder uses α = 1and it means that the weight of edge in Eq (1) is determined by compound similarity and the search will be driven by compound similarity. When we want to find the pathways that consist of reactions with low DG 0 r and similar compounds, we can set ψ(Sc,Td) = ψ(true, true). If ψ(Sc,Td) = ψ(true, true), AGPathFinder uses α = 0.5 and it means that the weight of edge in Eq (1) is determined by compound similarity and Gibbs free energy and the search will be driven by compound similarity and Gibbs free energy.
The following example illustrates the process of creating weighted atomic group transfer graph between two compounds. Example 2.2: Fig 3 shows an abstract representation of a weighted atomic transfer graph G ag between start compound C1 and target compound C6. At the beginning of algorithm CAGTG, there is only one node in G ag . We put C1 in queue Q. In an iterative manner, we remove a node from Q each time. The first node removed from Q is C1, which is not the target compound. For the unvisited nodes C2, C3 and C7 adjoining to C1, we use algorithm CAGM to compute the atomic groups transferred from C1 to C2, C3 and C7 respectively, the resulting atomic groups are S2, S3 and S7.  An abstract representation of weighted atomic group transfer graph G ag . A square rectangle represents a compound node. The atoms are represented as circles. C1, C2, C3, C4, C5, and C6 are the compound identifiers. The edges linking atoms represent chemical bonds, and the rounded rectangles represent reactions that contain the atom mappings between compounds, with the reaction identifiers being R1, R2, R3, R4, R5, R6, R7 and R8. W1, W2, W3, W4, W5, W6 and W7 are the weights of the edges R1, R2, R3, R4, R5, R6 and R7 respectively. Both red atoms and blue atoms are the conserved atoms from start compound C1. In compound C6, the group of red atoms with associated bonds and the group of blue atoms with associated bonds are two conserved atomic groups transferred from start compound C1 to target compound C6. Since the conserved atoms transferred from C7 to C8 through R8 do not form atomic group in C8, R8 and C8 are shown with arrows in dotted line to indicate that R8 and C8 do not exist in G ag . doi:10.1371/journal.pone.0168725.g003 Finding Metabolic Pathways Using Atomic Group Tracking and C7 as visited. Since the resulting atomic groups S2, S3 and S7 are not empty, we use Eq (1) to compute the weights W1, W2, W7 of edges R1, R2 and R7 respectively. Then we add nodes C2, C3, C7 and edges R1, R2, R7 to G ag , and put C2, C3, C7 in queue Q. We replace the conserved atomic group sets of C2, C3 and C7 with S2, S3 and S7 respectively. Next, the node to be removed from Q is C2 which is (again) not the target compound. For the unvisited node C4 adjoining to C2, we use CAGM to compute the conserved atomic groups transferred from C2 to C4, the resulting atomic group is S4 with the atom set {2,3}V(C4) and the bond set {(2,3)} E(C4). Node C4 is marked as visited. Since S4 is not empty, we compute the weight W3 of edge R3 by Eq (1). Then we add node C4 and edge R3 to G ag , and put C4 in queue Q. We replace the conserved atomic group set of C4 with S4. This procedure is repeated until Q is empty. When Q is empty, the atomic group transfer graph between C1 and C6 has been created. Now the top kshortest paths can be determined from G ag . For instance, if k = 2, we determine the top 2 shortest paths with the smallest weight from C1 to C6 in G ag as candidate metabolic pathways, and these 2 pathways are C1!R1!C2!R3!C4!R5!C6 and C1!R2!C3!R4!C5!R6!C6.

Results
From the KEGG LIGAND database, we obtained 5848 compound structures and 7340 reactions which have corresponding KEGG RPAIR entries. We used the SMSD tool to compute the similarity between compounds. The atomic group transfer graph is built based on the KEGG RPAIR database. We have implemented AGPathFinder in Java.
In order to evaluate the performance of AGPathFinder, the results are compared with several available metabolic pathfinding methods using atom tracking and an available graphbased method Tinker [19]. These atom tracking methods are RouteSearch [47], LPAT [46] and ReTrace [30] which are the software available and currently maintained. Tinker [19] is a recently developed method that finds pathways based on tailored heuristic search strategy and requires excluding hub metabolites. In the experiments, we use a set of 42 known pathways (as detailed in S1 Text) that were derived from the aMAZE database [52] and were commonly used for the evaluation of pathfinding methods in the literature [46]. The five methods AGPathFinder, RouteSearch, Tinker, LPAT and ReTrace are used to compute the pathways between the start and target compounds of each of the 42 known pathways. Then we compare the computed pathways with the corresponding known pathways to evaluate the performance of the methods. Furthermore, three study cases will be carried out to learn more about the characteristics of these methods.
RouteSearch is a web-based pathfinding tool. We used RouteSearch to search pathways on Biocyc.org. We downloaded Tinker, LAPT and ReTrace from http://osslab.lifesci.warwick.ac. uk/ tinker.aspx, http://www.kavrakilab.org/atommetanet and http://www.cs.helsinki.fi/group/ sysfys/software/ReTrace respectively. AGPathFinder, LPAT and ReTrace were run on the Sugon 5000A parallel computer at Guangxi University, using a single computing node with a quad-core Intel(R) Xeon(R) CPU E5620 @ 2.40GHz and 40GB RAM. The running operating system is Linux. Tinker was implemented in C# and runs on a PC with Intel(R) Pentium(R) CPU G3240 @ 3.10GHz and 8GB RAM, and the running operating system is Windows 7. When Tinker was run to search pathways, the hub metabolites listed in [19] (see also S1 Table) are excluded in advance.

Comparing computed pathways to known pathways
For each pathway, we use measures defined previously in the literature [46] to compute the accuracy Ac, sensitivity Sn and positive prediction value PPV to evaluate the biochemical Finding Metabolic Pathways Using Atomic Group Tracking performance of pathways computed by AGPathFinder, RouteSearch, Tinker, LPAT and ReTrace. To describe these measures, we need to define the correct compounds in the computed pathway. The compounds in the computed pathway are called correct compounds if these compounds satisfy the following two conditions: (1) The compounds can be found in both computed and known pathways, which are called included compounds. (2) The order of the included compounds in the computed pathway is the same as the order of the included compounds in the known pathway.
Then the values of Sn and PPV are defined as follows: Sn = TP/(TP+FN) and PPV = TP/(TP +FP), where true positives (TP) are the correct compounds found in the computed pathway, false negatives (FN) are the compounds in the known pathway but not in the computed pathway, and false positives (FP) are the compounds not in the known pathway but in the computed pathway [46]. Because true negatives do not exist in this comparison, Ac = (Sn+PPV)/2 [46]. We use cross-validation [53] to estimate the error Er for the compounds between computed pathway and known pathway. The smaller the error Er is, the more similar the computed pathway and the known pathway are. We can use the error Er to analyze the ability of pathfinding methods in recovering known pathways. Besides Ac, Sn, PPV and Er, we also use F-measure Fm for compound to evaluate the performance of pathfinding methods, where Fm = (2×PR×RC)/(PR+RC), PR is the precision and PR = PPV, RC is the Recall and RC = Sn, and Recall is the proportion of positive cases [54].
In addition to measuring the performance of the computed pathway based on compound, we also measure the performance of the computed pathways based on reaction. By analogy with the definition of correct compound, we derive the definition of the correct reaction in the computed pathways. The reactions in the computed pathway are called correct reactions if these reactions satisfy the following two conditions: (1) The reactions can be found in both computed and known pathways, which are called included reactions. (2) The order of the included reactions in the computed pathway is the same as the order of the included reactions in the known pathway.
The values of the sensitivity Sn_R and positive prediction value PPV_R for reaction are defined as follows: Sn_R = TP_R/(TP_R+FN_R) and PPV_R = TP_R/(TP_R+FP_R), where true positives (TP_R) are the correct reactions found in the computed pathway, false negatives (FN_R) are the reactions in the known pathway but not in the computed pathway, and false positives (FP_R) are the reactions not in the known pathway but in the computed pathway. The accuracy for reaction is Ac_R = (Sn_R+PPV_R)/2. By analogy with Er, we also use cross-validation [53] to estimate the error Er_R for the reactions between computed pathway and known pathway. Besides Ac_R, Sn_R, PPV_R and Er_R, we also use F-measure Fm_R for reaction to evaluate the performance of pathfinding methods, where Fm_R = (2×PR_R×RC_R)/(PR_R +RC_R), PR_R is the precision and PR_R = PPV_R, RC_R is the Recall and RC_R = Sn_R.

AGPathFinder versus other methods
In this section, for each pair of start and target compounds of the 42 known pathways in the test, we use AGPathFinder, RouteSearch, Tinker, LPAT and ReTrace to find the top ten pathways. These top ten pathways are then compared to the known pathways and the results are shown in Tables 2 and 3.
As can be seen from Tables 2 and 3, when we focus on the top path computed by each method, AGPathFinder showed improved performance compared to four other methods in most cases, the only exception is that the Ac, Sn and Fm values of AGPathFinder in the case of ψ(Sc,Td) = ψ(false, true) are a bit lower than those of RouteSearch, and the Er value of AGPath-Finder in the case of ψ(Sc,Td) = ψ(false, true) is a bit higher than that of RouteSearch.
Regarding the performance of the best of top ten paths, in Table 2, we can see that AGPath-Finder obtained higher values of Ac, PPV, Sn and Fm and lower values of Er than Tinker, LPAT and ReTrace whereas the performance of AGPathFinder is comparable with the performance of RouteSearch except for the sensitivity Sn. As can be seen from Table 3 The results from Tables 2 and 3 demonstrate that inferring metabolic pathways by tracking atomic group and using combined information of reaction thermodynamics and compound similarity improves the quality of computed pathways. The ability of AGPathFinder in recovering known pathways is better than the other four methods. Table 4 shows the values of S-Paths and A-length of the pathways computed by each method where S-Paths is the number of the computed pathways and A-length is the average length of the computed pathways. It can be seen from Table 4 that the average length of the pathways found by AGPathFinder is much shorter than the other four methods. Moreover, the pathways found by our method are more similar to the known metabolic pathways (Tables 2 and 3), that is, the pathways found by our method contain more reactions that are the same as in the known pathways. The reason for the shorter lengths of the pathways found by AGPathFinder is because the distances between reactions within the same metabolic pathway are significantly shorter than those between pairs of reactions selected at random [37] and more reactions in the pathways found by our method are involved in the same known pathways.
Note that, for the computed pathways in Table 4, no hub metabolites listed in [19] are involved in the computed pathways of AGPathFinder, Tinker, LPAT and ReTrace. However, some hub metabolites listed in [19] are involved in 272 out of 412 computed pathways of RouteSearch.
The above results demonstrate that compared with RouteSearch, Tinker, LPAT and ReTrace, AGPathFinder can find shorter pathways with better biochemical performance in general.

The impact of parameter setting on the performance of AGPathFinder
In the following, we investigate the impact of the size of atomic group, the compound similarity and the reaction thermodynamics on the performance of AGPathFinder. The performance of AGPathFinder under different parameter settings is shown in Tables 5 and 6. Tables 5 and 6 show that, for each setting of ψ(Sc,Td), the computed pathways that conserve the maximal size of atomic group have the highest values of Ac, Sn, PPV, Fm, Ac_R, PPV_R, Table 4 Sn_R, and Fm_R. This confirms that the transfer of atomic groups from the start compound to the target compound is an important feature for metabolic pathways, and the size of an atomic group has direct impact on the biochemical performance of the pathways found by our method. It can also be observed from Tables 5 and 6 that the parameter setting of ψ(Sc,Td) directly influences the performance of AGPathFinder. For example, when we search pathways by tracking "Maximal size of atomic group", for the top path, the setting ψ(Sc,Td) = ψ(true,false) gives the highest values of Ac, Sn, PPV and Fm, and the setting ψ(Sc,Td) = ψ(true,true) gives the highest values of Ac_R, PPV_R, Sn_R, and Fm_R. While the setting ψ(Sc,Td) = ψ(true,true) produces the highest values of Ac, Sn, PPV, Fm, Ac_R, PPV_R, Sn_R, and Fm_R in the best of top ten paths.

S-Paths A-length
Moreover, the use of combined information of compound similarity and reaction thermodynamics ensures that our method is still capable in finding meaningful pathways even without the option of tracking atomic groups. For example, without atomic group tracking and from a total of 42 known pathways in the test, AGPathFinder successfully recovered 28, 27 and 28 known pathways that are returned as the best of top ten paths in the cases of ψ(Sc,Td) = ψ (false,true), ψ(Sc,Td) = ψ(true,false) and ψ(Sc,Td) = ψ(true,true) respectively, and recovered 18, 21 and 22 known pathways that are returned as the top path in the cases of ψ(Sc,Td) = ψ(false, true), ψ(Sc,Td) = ψ(true,false) and ψ(Sc,Td) = ψ(true,true) respectively.

Study cases
The above analysis clearly demonstrates that our method AGPathFinder improves the biochemical relevance of the computed pathways. In order to investigate the factors that may influence the biochemical relevance of the found pathways, we perform three representative test cases to analyze the results obtained by AGPathFinder, RouteSearch, Tinker, LPAT and ReTrace. The aim of this section is not to demonstrate the average or overall performance of pathfinding (these were already discussed in section "Comparing computed pathways to known pathways"), but to gain insight into the characteristics of the methods through analysis.
L-serine biosynthesis. The biosynthesis of L-serine starts with 3-phospho-D-glycerate to L-serine. 3-phospho-D-glycerate contains 11 atoms, 3 of which are carbons, and L-serine contains 7 atoms, 3 of which are carbons. An atomic group containing 5 atoms (C, C, C, O, O) is transferred from 3-phospho-D-glycerate to L-serine in the biosynthesis of L-serine.   [55]. The top ranking pathway returned by ReTrace is r 3 in KEGG. The top ranking pathway returned by Tinker is r 4 in the Rhea database [56].
Recall that if ψ(Sc,Td) = ψ(false,true), the search of the pathways will be driven by Gibbs free energy of the reactions. To investigate the impacts of the utilization of energy on finding pathways from 3-phospho-D-glycerate to L-serine, Fig 5 shows the top three pathways found by AGPathFinder in the search of L-serine biosynthesis in KEGG when ψ(Sc,Td) = ψ(false,true). These three pathways are represented by p 1 , p 2 and p 3 respectively.
As we can observe from Fig 4, compared with pathways r 1 , r 2 , r 3 and r 4 , the pathway r_c for the biosynthesis of L-serine consists of reactions with minimal DG 0 r and highly similar  compounds. By tracking atomic group and using the combined information of compound similarity and reaction thermodynamics, our method found r_c in all cases with different settings of ψ(Sc,Td). AGPathFinder is thus useful for retrieving known metabolic pathways. In addition, r 1 , r 2 , r 3 and r 4 are obviously different from r_c except for the start and target compounds. r 2 is the shortest, but hub metabolite H + occurs in its pathway.
In Fig 5, the search of the pathways is driven by Gibbs free energy of the reactions and the pathways that involve the reactions with low energy are ranked ahead. For example, compared with pathway p 3 , the energies of the corresponding reactions in pathway p 2 are lower and therefore p 2 is ahead of p 3 . Note that our method is a shortest-path-based method, although the energies of the reactions in pathway p 1 are not the lowest, p 1 is the top ranking pathway since it is the shortest pathway. Moreover, the energies of the reactions in p 2 are negative. This indicates that when we try to find the pathways releasing energy from 3-phospho-D-glycerate to L-serine, we can choose p 2 . On the other hand, the energies of the reactions in p 1 and p 3 can be negative or positive, and the user can find the pathways that either require or release energy such as p 1 and p 3 .
Glycolysis. Glycolysis starts from beta-D-Fructose 6-phosphate to phosphoenolpyruvate in aMAZE [52]. To study the impacts of the utilization of energy on finding pathways from beta-D-Fructose 6-phosphate to phosphoenolpyruvate, Fig 7 shows the top three pathways found by AGPath-Finder in the search of glycolysis in KEGG when ψ(Sc,Td) = ψ(false,true). These three pathways are represented by p 1 , p 2 and p 3 respectively.
As can be seen from Fig 6, the similarity between two compounds involved in each reaction in Fig 6 is high, most of these similarities are higher than 0.4. We can observe that r 5 is similar to r_c, and their difference is that r 5 does not bypass beta-D-Fructose 1,6-bisphosphate. Compared to r_c, the pathways r 1 , r 2 and r 3 do not go through beta-D-Fructose 1,6-bisphosphate and 3-Phospho-D-glyceroyl phosphate. In addition, r 2 and r 3 consist of an alternative pathway connecting beta-D-Fructose 6-phosphate with D-Glyceraldehyde 3-phosphate via reaction R01827. This alternative pathway is shorter than the pathway between beta-D-Fructose 6-phosphate and D-Glyceraldehyde 3-phosphate in r_c.
Furthermore, in r 1 , r 2 , r 3 and r 4 , there is a shortcut linking D-Glyceraldehyde 3-phosphate and 3-Phospho-D-glycerate via R07159 and R01058 respectively, these shortcuts are annotated in the corresponding KEGG map00010 (Glycolysis/Gluconeogenesis). The difference between r 4 and r_c is large except for the pathway from 3-Phospho-D-glycerate to Phosphoenolpyruvate. r 4 goes through the pathway from beta-D-Fructose 6-phosphate to D-Glyceraldehyde 3-phosphate as a result of the high similarities between the compounds contained in this part of r 4 , for example, both similarity between beta-D-Fructose 6-phosphate and Sorbitol 6-phosphate and similarity between D-Glucose 6-phosphate and D-Fructose 6-phosphate are 1. For the same reason, r 4 bypasses the pathway from D-Glyceraldehyde 3-phosphate to 3-Phospho-D-glycerate. This indicates that one can use compound similarity to filter pathway assignments for feasibility.
In addition, none of the reactions and compounds in r 6 and r 7 is common with those of r_c, except for the start and target compounds. We can also see that, r 2 and r 3 are very similar to r 1 , and they only differ in one reaction. These results show that AGPathFinder and LPAT are capable in finding similar alternative pathways of glycolysis.
Seen from Fig 7, pathway p 1 is the shortest pathway and therefore it is the top ranking pathway. Although the length of pathways p 2 and p 3 is the same, p 2 is ahead of p 3 since the sum of the energies of the reactions in p 2 is lower. In p 1 and p 2 , the energies of the reactions are negative, which demonstrates that we can search for the pathways releasing energy such as p 1 and Computed pathways for glycolysis: r 1 , r 2 , r 3 , r 4 , r 5 , r 6 and r 7 . Round rectangles represent compound nodes, rectangles represent reaction edges, and the data in parentheses denote the value DG 0 r of reaction and compound similarity respectively. "*" means that the Gibbs free energy of the corresponding reaction is not available.   by p 1 , p 2 and p 3 respectively). Round rectangles represent compound nodes, rectangles represent reaction edges, and the data in parentheses denote the value DG 0 r of reaction and compound similarity respectively. doi:10.1371/journal.pone.0168725.g007 p 2 from beta-D-Fructose 6-phosphate to phosphoenolpyruvate. In p 3 , the energy of reaction R01070 is positive whereas the energies of other reactions are negative, thus we can find the pathway that either require or release energy like p 3 .
L-Methionine biosynthesis. The biosynthesis of L-Methionine starts with L-Aspartate to L-Methionine. L-Aspartate contains 9 atoms, 4 of which are carbons, and L-Methionine contains 9 atoms, 5 of which are carbons. Two variants of this pathway are characterized in the yeast S. cerevisiae and the bacteria E. coli, respectively. An atomic group with 7 atoms (C, C, C, N, O, O, C) is transferred from L-Aspartate to L-Methionine in the biosynthesis of L-Methionine. Fig 8 shows the known pathways r_ce (the biosynthesis of L-Methionine for bacteria E. coli in aMAZE) and r_cs (the L-Methionine biosynthesis pathway for the yeast S. cerevisiae in aMAZE), and the pathways found by AGPathFinder, RouteSearch, Tinker, LPAT and ReTrace. The top ranking pathway returned by RouteSearch is r_ce in EcoCyc for E. coli K-12 substr. MG1655. The top ranking pathway returned by LPAT is r 1 in KEGG. For all three settings of ψ(Sc,Td), the top ranking pathway returned by AGPathFinder is r 2 in KEGG. Note that r 2 is the only pathway found by AGPathFinder in the case of ψ(Sc,Td) = ψ(false,true). The top ranking pathway returned by RouteSearch is r 2 in EcoCyc for S. cerevisiae. The secondranked pathway returned by AGPathFinder is r 3 in KEGG with the setting ψ(Sc,Td) = ψ(true, true). The top ranking pathway returned by ReTrace is r 4 in KEGG. The top ranking pathway returned by Tinker is r 5 in Rhea.
From Fig 8 it can be observed that, r 2 corresponds closely to r_cs, and their difference is that r 2 does not go through L-Homocysteine and reaction R04405. Because r 2 via reaction R00651 is shorter, AGPathFinder chooses reaction R00651 in the process of inferring r 2 . In r 2 , the energies of the reactions can be negative or positive, and we can find the pathway that either require or release energy like r 2 in the case of ψ(Sc,Td) = ψ(false,true).
Alternative pathway r 3 is very similar to r_cs except for one reaction. The value of DG 0 r of R01287 in r 3 is lower than the value of DG 0 r of R00651. Therefore, AGPathFinder chooses reaction R01287 in the process of inferring r 3 . Furthermore, r 1 is similar to r_cs, but their difference is far greater than the difference between r 3 and r_cs. In addition, r 4 and r 5 are completely different from r_ce and r_cs except for the start and target compounds. These alternative pathways demonstrate how AGPathFinder and other four methods can be used to expand the metabolism of L-Methionine synthesis. Through efficient search in the extensive spaces in designing synthetic metabolic pathways, the computational pathfinding methods can find new pathways producing the same target compound through different mechanisms than those already known. These pathways need to be further tested for biological and biochemical consistency before implementation. However, the results show promising alternatives to generate valuable products.

Discussion and Conclusion
This article presents a pathfinding method AGPathFinder for finding metabolic pathways. The main feature of AGPathFinder is its integration of atomic group tracking and combined information of reaction thermodynamics and compound similarity into the search of metabolic pathways. This feature distinguishes AGPathFinder from existing atom tracking pathfinding methods, which are restricted to track the user-defined atoms in the search for alternative pathways.
In section "Results", in most cases, we have shown that the average compound inclusion accuracy and reaction inclusion accuracy for the top resulting pathways of our method are around 0.90 and 0.70, respectively, which are better than those of RouteSearch, Tinker, LPAT and ReTrace. Atomic group tracking, when combined with weighted metabolite graph, improves the quality of the found pathways. With the introduction of atomic group tracking, our method does not require the user to define the atoms to be tracked neither to exclude hub metabolites in advance. On the other hand, the use of combined information of reaction thermodynamics and compound similarity ensures that our method is still capable in finding meaningful pathways even without the option of tracking atomic groups. The results have demonstrated that AGPathFinder successfully recovers the known pathways and finds the thermodynamically feasible pathways that avoid spurious connections. Moreover, Round rectangles represent compound nodes, rectangles represent reaction edges, and the data in parentheses denote the value DG 0 r of reaction and compound similarity respectively. "*" means that the Gibbs free energy of the corresponding reaction is not available. AGPathFinder allows the user to define biochemical parameters to search for specific pathways of interest, and it returns pathways with the information of reaction thermodynamics and compound similarity.
AGPathFinder infers pathways across all of the data in KEGG, but for some applications, researchers may be only interested in the metabolic network of a single organism or several related organisms. A possible solution for this issue is to use alternative weighting schemes. For example, we try to find organism-specific pathways by weighting the reactions depending on the possibility that the organism of interest performs the corresponding reactions. This may help to find feasible candidates for in vivo experimentation.
In the current work, we have not considered the substrate availability or toxicity, and did not take the different reaction conditions such as pH and temperature into account when estimating thermodynamics. In future work, we intend to include these factors into AGPathFinder. Another interesting extension is to combine constraint programming methods with atomic group tracking in searching metabolic pathway.

Availability of the Software
AGPathFinder is fully implemented in Java. The data and the program can be downloaded from http://210. 36 Supporting Information S1 Text. The test set of metabolic pathways. (PDF) S1 Table. The hub metabolites listed in Tinker. (DOC)