Fig 1.
The CANYUNs pipeline integrates biochemical, phenotypic, and genomic data to quantitatively identify reactions that are likely catalyzed by an organism.
(A) Genomic annotation data and phenotypic growth data for a specific organism are used to influence the flux distribution through a curated universal biochemical network to build an organism-specific metabolic network model. Parallel growth simulations using Data Guided Flux Balance Analysis for each known experimental growth condition allows for a model building process that is not influenced by the order in which growth conditions are integrated. This process allows for the explicit quantification of reaction Certainty Values, determined by the ratio of times a reaction carries flux across all of the condition-specific solutions to the total number of conditions. (B) The universal biochemical network used in this study consists of reactions from the CarveMe dataset as well as novel reactions added from the manually curated E. coli metabolic network, iML1515. (C) The phenotypic data used in this study includes Biolog minimal media growth data from ~275 different conditions. (D) The sequence-to-reaction dataset used to calculate reaction annotation evidence consists of over 4,000 reactions with 1 to 800 sequences associated with each reaction. (E) The distribution of reaction bitscores for E. coli K-12 shows that there are reactions in the universal network with high evidence that are not included in iML1515. There are also many reactions with low evidence that are not included in iML1515, as expected. The annotation evidence generated for E. coli K-12 shows that there are 1,460 reactions in the universal biochemical network that have no genetic evidence associated with them (left of the dashed orange line), 260 of these reactions are in iML1515 and 1,200 of them are not.
Fig 2.
Data Guided Flux Balance Analysis.
(A) Distribution of reaction bitscores for E. coli K-12. (B) This figure is a visual representation of the transformation function for calculating the reaction weights based on reaction bitscores. The reaction bitscore of 500 corresponds with zero in the weight space. (C) Distribution of the calculated weights for forward reactions. (D) The distribution of weights for reverse reactions shows that there are far fewer reactions that allow flux in both directions or only in the reverse direction. (E) Data Guided Flux Balance Analysis optimization problem. Reactions with a positive weight are maximized and reactions with a negative weight are minimized proportional to the value of the weight. (F) Toy network example demonstrating the flux-carrying reactions that would result from the pictured annotation evidence distribution and media inputs.
Fig 3.
Data Guided Flux Balance Analysis breaks parsimony and identifies fewer unique reactions required for simulated growth on all experimental growth conditions for E. coli K-12.
(A) The number of FCRs in each growth condition is visualized for pFBA and dgFBA to quantify the degree to which dgFBA breaks parsimony. (B) The number of reactions with bitscores above 500 that carry flux in a dgFBA solution is greater than the number in a pFBA solution. (C) The cumulative number of unique FCRs identified by dgFBA is fewer than pFBA. The complete range in number of unique FCRs is indicated by the shaded regions.
Fig 4.
E. coli K-12 CANYUNs model generation and draft processing.
(A) Ranked scatter plot of forward reaction Certainty Values. (B) Reverse reaction Certainty Values. (C) Certainty values for reversible reactions that carry flux in both directions. (D) Initial accuracy of CANYUNs model before curation of the universal biochemical network is 80% with a Matthews Correlation Coefficient of 0.45. (E) Simulation of conditionally essential reactions allow for the user to identify reactions that can be selectively removed from the resulting model that improve the overall predictive accuracy. The net benefit refers to the number of false positives that will be corrected minus the number of true positives lost due to removing a given reaction. RuBisCO is the forward reaction in the top left corner of the plot with maximum net benefit and minimum genetic evidence.
Fig 5.
The E. coli CANYUNs Model performs better than iML1515 and CarveMe when simulating growth on all known phenotypic data.
(A) The CarveMe model without gapfilling has a base accuracy of 52% and a Matthews Correlation Coefficient (MCC) of 0.09. (B) The CarveMe model we trained using all of the phenotypic data performs with an accuracy of 76% and an MCC of 0.29. However, there is a strong bias toward false positive predictions. (C) The manually curated E. coli K-12 model, iML1515, was not trained using all of the growth conditions. However, it performs with 75% accuracy and an MCC of 0.40 while maintaining a relatively even split between false positive predictions and false negative predictions. (D) The CANYUNs model we generated performs with 92% accuracy and an MCC of 0.78. The increased accuracy is primarily due to an improvement in true negative prediction rate.
Fig 6.
CANYUNs reaction Certainty Values accurately identify reactions found in iML1515.
The manually curated metabolic network, iML1515, provides a point of comparison to determine if CANYUNs accurately identifies reactions for inclusion in the network. (A) By comparing the CANYUNs model with iML1515, we were able to place reactions into four categories. FCRs with genetic evidence and in iML1515 (confirmed), FCRs without genetic evidence in iML1515 (true discovered), FCRs with genetic evidence not in iML1515 (likely additions), and FCRs without genetic evidence and not in iML1515 (false discovered). The total amount of genetic evidence that is used to generate a CANYUNs model influences the accuracy of the FCRs. (B) When we use pFBA instead of dgFBA in the CANYUNs pipeline, there are far more reactions that lack genetic evidence and are not in iML1515. (C) The percent overlap of FCRs with reactions present in iML1515 increases from 62% when no genetic evidence is used (pFBA) to 76% overlap when all of the available genetic evidence is used.
Fig 7.
Reaction Certainty Values correlate with accurate reaction inclusion and comparison with CarveMe.
(A) The percentage of reactions identified by CANYUNs that align with the iML1515 model correlates with the associated certainty value. All reactions with a certainty value greater than or equal to 0.99 have a 94% chance of being in the iML1515 model. (B) The accuracy of discovered reactions, with inclusion in iML1515 as reference, increases with the certainty values assigned using CANYUNs. (C) Although the accuracy of the discovered reactions increases with the certainty value, there is a significant drop in the number of reactions with the increase.
Fig 8.
(A) Phenotypic data used to build the model. (B) The final accuracy of the model is 92% with no false positive predictions. The model has an MCC of 0.83. (C) From the Universal Network, 1,746 reactions have Nissle specific genetic evidence associated with them; of those, there are 466 reactions that receive CVs from the CANYUNs pipeline. There are 176 reactions that do not have Nissle-specific genetic evidence, yet receive CVs and are thus included in the final CANYUNs model for Nissle. (D) These reactions received certainty values, but do not have associated genetic evidence; they are candidates for manually finding sequences to add to the reference dataset. (E) There is also a set of 103 reactions with CVs and low bitscores (below 500). Reactions with a high Certainty Value and a bitscore above 200 are likely candidates for direct additions to the sequence-to-reaction dataset.