Fig 1.
Database curation and feature extraction methodology.
Fig 2.
Feature additions via genome scale model simulations and data augmentation based on case studies described in the literatures.
Fig 3.
Summary of curated database showing distribution of titers (units in g/L) for 25 different products from the bacterium E. coli.
Table 1.
Metabolic engineering design factors template used for feature extraction.
Sample values are taken from [19]. Features that refer to a list of genes are entered as a vector of ones and zeros as categorical numbers. For example, in the sample values, ‘het_gene’ (whether the gene inserted/overexpressed was heterologous) is entered as 1,0,0 meaning alsS is heterologous while ilvC, ilvD are not. YE stands for yeast extract.
Fig 4.
Comparison of production metrics (titer, rate, and yield).
The size of the dots corresponds to the rate values (in g/L/h scaled by the minimum and maximum value– 0.000043 and 10.83 g/L/h respectively). Molecular weight of each product (g/mol) is shown by the color gradient of the dots (color bar).
Fig 5.
Inferring possible influential factors on metabolic engineering design performance.
A. First two principal components from multiple correspondence analysis (MCA). The labels correspond to titer values in g/L. The shaded areas for each point show the predicted area within which all points have a high probability of belonging to the specified titer range. B. Impact of different influential factors on first two principal components from principal component analysis (PCA). PCA plot shown in S1 Fig in S2 File. Carbon source 1, 2 and 3 are used to capture the cases in which more than one carbon source was used. If only one was used, corresponding entries of carbon source 2 and 3 were set to zero. E.coli MG1655 was taken as the reference strain and all modifications done to get the background strain used in each study were captured as ‘background modifications’. The scores describe the relative contribution of each feature to the principal components.
Fig 6.
Prediction of production metrics TRY.
R2: coefficient of determination. Solid lines are shown on the diagonal that represent where all the points would fall for perfect prediction. A scaled version of Fig 6 is presented in S4 Fig in S2 File (enabling the fit to visualized without the outlier effects). The data points are scaled based on the maximum value (titer, rate or yield) for the particular product in our curated database.
Fig 7.
A. Quantification of the effect of COBRA (Constraint-Based Reconstruction and Analysis)—based features on model performance. CV stands for the best cross validation accuracy (R2 values). Higher scores imply a better fit. B. Comparing individual machine learning performance with ensemble model. TS stands for Test Scores (R2 values). CV stands for the best cross validation accuracy (R2 values). Higher scores imply a better fit.
Fig 8.
Titer learning curve as the function of size of training data set.
The training scores (R2) and cross validation (CV) scores (also R2) are shown. Below 800 training examples, the cross-validation accuracies variation were too large. The hybrid model can fit the training data set (red points) well irrespective of the number of training examples. The cross-validation scores improve slightly with more data points. This implies that more feature engineering (and not necessarily more data) would be necessary to significantly improve model performance.
Fig 9.
Ensemble learning using stacked regressors.