Predicting bacterial growth conditions from mRNA and protein abundances

Cells respond to changing nutrient availability and external stresses by altering the expression of individual genes. Condition-specific gene expression patterns may thus provide a promising and low-cost route to quantifying the presence of various small molecules, toxins, or species-interactions in natural environments. However, whether gene expression signatures alone can predict individual environmental growth conditions remains an open question. Here, we used machine learning to predict 16 closely-related growth conditions using 155 datasets of E. coli transcript and protein abundances. We show that models are able to discriminate between different environmental features with a relatively high degree of accuracy. We observed a small but significant increase in model accuracy by combining transcriptome and proteome-level data, and we show that measurements from stationary phase cells typically provide less useful information for discriminating between conditions as compared to exponentially growing populations. Nevertheless, with sufficient training data, gene expression measurements from a single species are capable of distinguishing between environmental conditions that are separated by a single environmental variable.


Introduction
Environmental conditions across the planet vary in terms of their capacity to support microbial life. Individual environments can also change rapidly over time, and these changes are likely to impact the composition of microbial communities and ecosystem functions in unpredictable ways [1,2]. To measure various properties of the environment, microbial cells can be engineered to act as biosensors via rational design of synthetic genetic circuits [3]. In contrast to gold standard approaches that are comparatively labor intensive and expensive, microbial cells can be engineered, for instance, to rapidly screen for the presence of heavy metals in aquatic environments [4]. Such applications can provide a useful, low-cost diagnostic for monitoring environmental changes and detecting pollutants and/or toxins [5], but individual synthetic biology applications take time and resources to develop. Additionally, there is an everpresent concern about potential dangers associated with releasing genetically engineered species into natural environments.
By contrast, prior work has shown that the natural species composition of an environment may be sufficient to serve as a rapid and low-cost biosensor to indicate the presence of various contaminants according to the species abundances identified via meta-genomic sequencing [6][7][8][9]. However, many bacterial species within a community are generalists that are capable of thriving in diverse environments and must therefore sense and respond to various environmental signals [10]. For instance, Escherichia coli grows inside the comparatively warm, nutrient rich digestive tract of host organisms [11] but spends another portion of its life-cycle exposed to harsh environmental conditions upon being excreted and before finding another host. The mere presence of generalist species in an environment may provide little value for understanding past or current environmental conditions because their gene and expression diversity permits growth across variable environments [12]. The extent to which gene expression patterns of individual generalist species can be used to discriminate between environmental conditions-or to supplement species composition-based methods-remains unknown. Gene expression profiles for individual cells or populations contain a wealth of information about their current physiological state, but measurements for thousands of genes across numerous conditions are challenging to integrate under traditional statistical methods. Further, combining different 'omics'-scale technologies has been shown to provide more valuable information compared to monitoring only mRNA abundances alone, but integrating datasets is challenging due to the biases of individual methods [13] and the inevitability of batch-level effects that occur when datasets are generated across multiple labs and platforms [14,15]. Machine learning methods, by contrast, are frequently applied to such data-rich applications, for example to differentiate between cancerous and normal cells/tissues [16][17][18][19][20] using a variety of different machine learning models [21,22].
In microbiology applications, machine learning has been frequently applied to infer regulatory networks and molecular pathways from gene expression data [23][24][25], and from this knowledge to predict the growth capabilities of cells in different environments [26][27][28]. However, the primary focus in many of these studies has been to understand aspects of the cellular physiology. In this framework, environmental change serves as a perturbation that can be used to provide insight into internal cellular mechanisms/pathways [29]. While explicitly representing a cell's internal state may help to predict cellular phenotypes such as growth capabilities across environments [30][31][32], it is unclear whether explicit representation of cellular metabolic pathways, for instance, are necessary to distinguish between cells growing in different environmental conditions [33,34]. Few studies have focused on using the abundance of cellular macromolecules to predict external environmental features across a range of partially-overlapping conditions and cellular growth states.
Here, we are interested in determining whether gene expression patterns can be leveraged to discriminate between environmental conditions in the absence of prior knowledge about the role and function of individual genes or explicit representation of cellular metabolism. Our study leverages a large dataset of transcriptomic and proteomic measurements of E.coli growth under multiple distinct but closely-related conditions [35]. We use mRNA and protein composition data to train several distinct machine learning models and find that highly similar environmental conditions can be discriminated with a high degree of accuracy. We also investigate which conditions are more-and less-challenging to discriminate and find that prediction accuracies decrease for stationary phase cells, indicating the importance of cellular growth for discriminating between conditions. Finally, we caution that the overall accuracy of our models may be limited by training set size; we found that the most difficult conditions to predict are the conditions for which we have the smallest amount of training data. This suggests that our findings may represent a lower bound on the predictive power that is achievable given a greater availability of training data.

Data structure and pipeline design
We used a previously generated dataset of whole-genome E. coli (strain REL606) mRNA and protein abundances, measured under 34 different conditions [35,36]. This dataset consists of a total of 155 samples, for which mRNA abundances are available for 152 and protein abundances for 105 (Fig 1). For 102 samples, both mRNA and protein abundances are available. The 34 different experimental conditions were generated by systematically varying four parameters: carbon source, growth phase, Na + concentration, and Mg 2+ concentration. Here we further simplified the experimental conditions into a total of 16, by grouping similar conditions together (e.g., 100, 200, and 300mm Na + were all labelled as "high Na + "). For the remainder of this work (unless otherwise noted) we use the term "growth condition" to refer to the Overview of available gene expression data. Our study uses a previously published dataset consisting of 155 samples [13,14]. 152 samples have whole-transcriptome RNA-seq reads and 105 have mass-spec proteomics reads. 102 of the 155 samples have both mRNA and protein reads. Bacteria were grown on four different carbon sources (glucose, glycerol, gluconate, and lactate), two sodium concentrations (base and high), and three magnesium concentrations (low, base, and high). Samples were taken at multiple time points during a two-week interval, and they can be broadly subdivided into exponential phase, stationary phase, and late stationary phase samples.
four-dimensional vector of categorical variables defining: i) growth phase (exponential, stationary, late stationary), ii) carbon source (glucose, glycerol, gluconate, lactate), iii) Mg 2+ concentration (low, base, high), and iv) Na + concentration (base, high). While we note that growth phase is not strictly an environmental feature, we suspected that this indicator of cellular state would be an important feature to consider since prior research has shown that the macromolecular composition of cells varies substantially between exponentially growing and stationary phase cells [35,36]. With these data and features, the question we set out to answer is: to what extent are machine learning models capable of discriminating between the known growth parameters given only knowledge of gene expression levels?
We first split samples into training/validation and test datasets using a semi-random approach that randomly splits data while preserving class balances. We performed several data processing steps, including batch correction and Principal Component Analysis (PCA), to reduce the dimensionality of the data (see Materials and Methods for details). We analyzed the top 10 genes contributing to the dominant principal components (PC1 and PC2, in both mRNA and protein datasets) and found that they all have orthologs in both B and K strains suggesting that data collection/extrapolation across different strains may not be particularly problematic for future studies (S1 Table). Additionally, PC1 was enriched for highly expressed genes in both mRNA and protein datasets (elongation factors, RNA polymerase subunits, outer membrane proteins, etc.), with the protein datasets also consisting of important chaperones (dnaK and groEL).
During the model tuning phase, we optimized hyperparameters in the machine learning pipeline by further splitting the training/validation data into training and validation sets, fitting models to the labeled training set, and optimizing for model accuracy on the validation set. We performed cross-validation by making 10 unique splits of the training/validation samples-with 75% of samples in training and 25% in validation sets-and searched across a parameter grid to select the hyperparameters that gave the highest F 1 score on the validation set. Finally, we tested the accuracy of model predictions on the test dataset using the optimized hyperparameters from the tuning phase. To assess the overall robustness of our findings, we used repeated testing to replicate our entire pipeline 60 times and report the mean and range of variation in our final test set accuracies. Our pipeline is illustrated in Fig 2 and

Growth conditions can be predicted accurately from both mRNA and protein abundances
After constructing our analysis pipeline, we first asked whether there were major differences in the performance of different machine learning approaches. Since our overall goal was to demonstrate the feasibility and limitations of using machine learning on gene expression data to predict environmental features, we wanted to: i) ensure that our choice of machine learning algorithm did not substantially affect our results/conclusions and ii) determine the best method for this particular application since prior work has shown that the choice of machine learning model can substantially affect the accuracy of best fitting models [21,22]. We tested four different machine learning models: three based on Support Vector Machines (SVMs) with different kernels (radial, sigmoidal, and linear) and a fourth using random forest classification. We trained our models to predict [12,37] the entire four-dimensional condition vector at once for a given sample, and used the multi-class macro-F 1 score [38] to quantify prediction accuracy.
We note that various metrics can be applied to quantify model accuracy during classification tasks-each with particular strengths and limitations. The multi-class macro-F 1 score is the harmonic mean of precision (of all the positive predictions made by a model, "what fraction are correct?") and recall (of all the possible positive predictions, "what fraction does the model return?"). This quantity approaches zero if either quantity approaches zero, and it approaches one if both quantities approach one (representing perfect prediction accuracy). We further emphasize that our scoring scheme will classify a prediction as incorrect if even a single variable is incorrectly predicted, even if the predictions for the remaining three variables of interest are correct. We made this choice, rather than binary classification of individual variables, so that our findings would be conservative and represent a lower bound on the prediction accuracy for this task. Our pipeline can be separated into three parts: (i) initial data preparation, (ii) training and prediction, and (iii) model tuning. After (i) initial data preparation, the samples are (ii) semi-randomly (preserving sub-sample ratios) separated into 2 parts, the training/validation set and the test set. After applying fSVA and PCA to the training/validation data, we train supervised SVM or random forest models on the training/validation set. After obtaining the tuned model we make predictions on the test data that has been batch corrected (via fSVA) and rotated (via PCA). This whole process is repeated 60 times to collect statistics on model performance. For model tuning (iii), the training/validation data set is similarly divided semi-randomly into training and validation datasets to optimize hyperparameters using a grid search approach. The tuning procedure is repeated 10 times and the parameter set that performs best-on average-during the 10 repeats is considered the winning model and is used for prediction on the test set data. We assessed model performance during the tuning stage of our pipeline by recording which model and hyper-parameter set had the best macro-F 1 score for the validation set (S1 and S2 Figs). During this tuning stage, we found that the SVM model with a radial kernel clearly outcompeted the other models when fit to mRNA data, and the random forest model outcompeted the other models when fit to protein data (Table 1).
We next compared the F 1 scores for model predictions applied to the test set. When using mRNA abundance data alone, the distribution of F 1 scores from repeated testing of 60 independent replications were centered around a value of~0.55 (Fig 3). The F 1 score distributions were virtually identical for the three SVM models and was lower for the random forest model. Model performance on test data using only protein abundance measurements was slightly worse than what was achieved with mRNA abundance data. However, it is important to note that the protein abundance data contains fewer samples overall, which may partially explain Table 1. Winning-model distributions at the tuning stage. Numbers show the number of times out of 60 independent runs that each given model had the highest F 1 score in the tuning process. Results are shown separately for predictions on the mRNA and the protein data. The ties are counted for all the "winner" models as a result the sums are bigger than 60.  Distributions of multi-class macro F 1 score for prediction of growth conditions from mRNA or protein abundances, using four different machine-learning algorithms (SVM with radial, sigmoidal, or linear kernel, and random forest [RF] models). For each model type, 60 independent models were trained on 60 independent subdivisions of the data into training/validation and test sets. We found that random forest models consistently performed worse than SVM models, and predictions based on mRNA data were slightly better than predictions based on protein data. The black dots represent the mean F 1 scores.

Model
the decreased predictive accuracy of the protein-only model-a point to which we return to later.
In addition to assessing the overall accuracy of our predictive models using F 1 scores, we also recorded the percentage of times specific growth conditions were accurately or erroneously predicted. We report these results in the form of a confusion matrix (Fig 4). Here, the Test set prediction accuracy for specific growth conditions. In each matrix, rows represent true conditions and columns represent predicted conditions. The numbers in the cells and the shading of the cells represent the percentage (out of 60 independent replicates) with which a given true condition is predicted as a certain predicted condition. (A) Predictions based on mRNA abundances. Results are shown for the SVM with radial kernel, which was the best performing model in the tuning process on mRNA data, where it won 55 of 60 independent runs. In this panel, the average of the diagonal line is 60.5% and corresponding multi-class macro F 1 score is 0.61. (B) Predictions based on protein abundances. Results are shown for the SVM with sigmoidal kernel, which was the best performing model in the tuning process on protein data, where it won 41 of 60 independent runs. In this panel, the average of the diagonal line is 55.1% and corresponding multi-class macro F 1 score is 0.56.
https://doi.org/10.1371/journal.pone.0206634.g004 column headings at the top show the predicted condition from the model on the test set and the rows show the true experimental condition. The numbers and shading in the interior of the matrix represent the percentage of cases that a given experimental condition was predicted to be a certain growth condition (numbers within each row add up to 100). The large numbers/dark colorings along the diagonal highlight the high percentage of true positive predictions whereas any off-diagonal elements represent incorrect predictions. We found that the erroneous off-diagonal predictions are partially driven by the uneven sampling of different conditions in the original dataset. Even though we used sample-number-adjusted class weights in all fitted models, we observed a trend of increasing fractions of correct predictions with increasing number of samples available during the training stage (S3 Fig).
As we previously noted, the F 1 score quantifies accuracy by only considering perfect predictions (i.e. when all 4 features are correctly predicted); a sample that is incorrectly classified for all four features is thus treated the same as one that only differs from the true set of features by a single incorrect factor. In practice, however, we observed that the majority of incorrect predictions differed from their true condition vector by only a single value (S4 Fig).

Joint consideration of mRNA and protein abundances improves model accuracy
We next asked whether predictions could be improved by simultaneously considering both mRNA and protein abundances. To address this question, we limited our analysis to the subset of 102 samples for which both mRNA and protein abundances were available and ran our analysis pipeline for mRNA abundances only, protein abundances only, and for the combined dataset containing both mRNA and protein abundances. For all four machine-learning algorithms, protein abundances yielded significantly better predictions than mRNA abundances ( Fig 5, Table 2). This is in contrast to Fig 3, where we saw increased accuracy using mRNA abundance data. However, as previously noted, our dataset contains more mRNA abundance samples, which results in a larger amount of training data for the results presented in Fig 3. When compared on the same exact conditions-as depicted in Fig 5-protein abundance data appears more valuable for discriminating between different growth conditions. Notably, the combined dataset consisting of both mRNA and protein abundance measurements yielded the best overall predictive accuracy, irrespective of machine-learning algorithm used (Fig 5, Table 2).
When considering the confusion matrices for the three scenarios (mRNA abundance, protein abundance, and combined), we found that many of the erroneous predictions arising from mRNA abundances alone were not that common when using protein abundances and vice versa (S5 and S6 Figs). For example, when using mRNA abundances, many conditions were erroneously predicted as being exponential phase, glycerol, base Mg 2+ , base Na + ; or as stationary phase, glucose, base Mg 2+ , high Na + ; these particular erroneous predictions were rare or absent when using protein abundances. By contrast, when using protein abundances, several conditions were erroneously predicted as being stationary phase, glycerol, base Mg 2+ , base Na + , and these predictions were virtually absent when using mRNA abundance data. For predictions made from the combined dataset, erroneous predictions unique to either mRNA or protein abundances were suppressed, and only those predictions that arose for both mRNA and protein abundances individually remained present in the combined dataset (S7 Fig).

Prediction accuracy differs between environmental features
We next assessed the sources of inaccuracy in our models. As previously noted, the majority of incorrect predictions differed by only a single factor (S4 Fig). The environmental features that accounted for most of these single incorrect predictions were Mg 2+ concentration for the protein-only data and carbon source for mRNA-only data. Despite the importance of growth phase to macromolecular abundances, we reasoned that growth (e.g. exponential, stationary, late-stationary) is not an environmental variable and using this as a feature may partially skew our results if the goal is to predict strictly external conditions. Models trained on both mRNA and protein data perform better than models trained on only one data type. The 102 samples for which we have both protein and mRNA abundances were used to compare the performance of machine learning models based on only mRNA, only protein, and mRNA and protein data combined (left to right, respectively). Regardless of the machine learning model used, prediction performance was higher for models that use protein data compared to mRNA data. Further, using both mRNA and protein data resulted in higher predictive power compared to either alone. Statistical significance of these differences is reported in Table 2.
https://doi.org/10.1371/journal.pone.0206634.g005 Table 2. Statistical significance of comparisons shown in Fig 5. Distributions of multi-class macro F 1 scores were compared using t-tests. The adjusted P value reports the false discovery rate (FDR). All comparisons are statistically significant after correction for multiple testing via FDR. We thus trained and tested separate models using only exponential or only stationary phase datasets and asked to what extent these models could predict the remaining 3 environmental features (carbon source, [Mg 2+ ], and [Na + ]). We found that prediction accuracy was consistently better for models trained on exponential-phase samples compared to models trained on stationary-phase samples, irrespective of the machine-learning algorithm used or the data source (mRNA, protein abundances, or both) (Fig 6). This observation implies that E. coli gene expression patterns during stationary phase are less indicative of the external environment compared to cells experiencing exponential growth. Despite the lower accuracies, however, predictive accuracy from models trained solely on stationary phase cells was still much higher than random expectation, highlighting the fact that quiescent cells retain a unique signature of the external environment for the conditions studied.

Model
To better understand which conditions were the most problematic to predict, we constructed models to predict only individual features rather than the entire set of 4 features. This is an easier task when compared to predicting all 4 dimensions simultaneously, and this ease is reflected in the relatively accurate confusion matrices that we observed (S8 Fig). For predictions based on mRNA abundances only, models were most accurate in predicting growth phase and least accurate for carbon source, with Mg 2+ and Na + concentration falling between these two extremes. By contrast, for predictions based on protein abundances, the most predictable feature was carbon source, the least predictable was Mg 2+ concentration with Na + concentration and growth phase fell in-between these two extremes (Fig 7, S8 Fig). Finally, for the combined mRNA and protein abundance dataset, we found that accuracy for carbon source and Mg 2+ concentration fell between the accuracies observed using mRNA and protein abundances individually. By contrast, accuracies for the Na + concentration and growth phase were as good as-or better than-the prediction accuracies of the individual datasets ( S9 Fig). Together, these findings highlight that mRNA and protein abundances differ in their ability to discriminate between particular environmental conditions. Prediction accuracy systematically declines from exponential to stationary phase growth. We separated data by growth phase and then trained separate models to predict carbon source, magnesium level, and sodium level within each growth phase. Regardless of the data source, prediction accuracy was substantially lower for stationary-phase samples than for exponential-phase samples. For each model and growth phase, dots show the mean F 1 score over 60 replicates and lines connect mean F 1 scores calculated for the same model. Predicting bacterial growth conditions from mRNA and protein abundances

Model validation on external data
The samples that we studied throughout this manuscript are fairly heterogeneous and were collected by different individuals over a span of several months/years. However, different sample types were still analyzed within the same labs, by the same protocols, and thus may be more consistent than one might expect from data collected and analyzed independently by different labs-which would be an ultimate goal of future applications of this methodology. We thus applied our best-fitting protein abundance model to analyze protein data with similar conditions that was independently collected and analyzed [12]. However, the largest external comparison dataset that we could find consisted of measurements for only~2,000 proteins, which is substantially less the 4196 proteins that we measured and constructed our models on. Further, the particular bacterial strain (BW25113, a "K" strain) used in this external dataset was distinct from ours (REL606, a "B" strain), so not all of the proteins from our model have direct orthologs in this external dataset. Based on our analysis of the dominant genes contributing to the principal components (S1 Table), however, this strain level-variation may be less important than the missing data values. We tested two alternative approaches of applying our model to the external data. For the first approach, we filled the missing parts of the external data with the median values of our in-house data before making predictions (Table 3). In the second approach, we restricted our training dataset to only include proteins that appeared in the external validation data set (Table 4). These two approaches lead to comparable results. Predicting bacterial growth conditions from mRNA and protein abundances Notably, our model made mostly correct predictions on this entirely independent dataset. The model was most accurate at distinguishing between different growth phase data, and moderately accurate at distinguishing Na + concentration and carbon source. The external data did not consist of samples with variable Mg 2+ concentrations, however, and we note that our model incorrectly predicted several samples to have high Mg 2+ .

Discussion
Our central goal here was to determine whether gene expression measurements from a single species of bacterium are sufficient to predict environmental features. We analyzed a rich dataset of 152 samples for mRNA data and 105 samples for protein data across 16 distinctly classified laboratory conditions as a proof-of-concept. We showed that E. coli gene expression is responsive to external conditions in a measurable and consistent way that permits identification of environmental features from gene signatures alone via supervised machine learning techniques.
While E. coli is a well-characterized species, our analysis relies on none of this a priori knowledge. Previous approaches have focused on modeling cellular biology and metabolism in order to predict the growth capabilities of individual species in various environments [27][28][29]. Rather than using varied environmental conditions to interrogate cellular regulation [23,25], we instead determined that the abundances of cellular macromolecules themselves are sufficient to provide accurate information about environmental conditions.  Table 3, here we show the accuracy of predictions based on a model that was trained only on the subset of proteins from our dataset that were present in the external test data.
Regular text indicates a correct prediction for the sample in the given column, the ‡ symbol indicates an incorrect prediction, and the † symbol indicates a prediction where the external data falls between two categories in our data.
https://doi.org/10.1371/journal.pone.0206634.t004 Interestingly, we found that consideration of mRNA and protein datasets alone is sufficient to produce accurate results, but that joint consideration of both datasets results in superior predictive accuracy. This finding implies that post-transcriptional regulation is at least partially controlled by external conditions, which has been observed by previous studies that have investigated multi-omics datasets [13,37,39,40]. Such regulation may result from post-translational modifications [41], stress coping mechanisms [42], differential translation of mRNAs, or protein-specific degradation patterns.
Our results show that cellular growth phase places limits on the predictability of external conditions, with stationary phase cells being particularly difficult to distinguish from one another irrespective of their external conditions. A possible explanation for this behavior may be endogenous metabolism, whereby stationary phase cells start to metabolize surrounding dead cells instead of the provided carbon source. This new carbon source, which is independent of the externally provided carbon source, may suppress differences between cells growing on different external carbon sources [43,44]. Another reason for this behavior might be related to strong coupling between gene expression noise and growth rate. Multiple studies have concluded that lower growth rates are associated with higher gene expression noise, which might be a survival strategy in harsh environments [45]. Negative correlations between population average gene expression and noise have been shown for E. coli and Saccharomyces cerevisiae, lending support for this theory [46,47]. Finally, we note that stationary phase cells are likely to have depleted the externally supplied carbon sources after several days of growth. The similarity of stationary phase cells to other stationary phase cells may be a consequence of them actually inhabiting more similar chemical environments to one another compared to during exponential growth where nutrient concentrations are more varied across conditions. Despite these caveats with regard to cellular growth phase, discrimination of external environmental factors in stationary phase cells was still much better than random-indicating that these populations continue to retain information about the external environment despite their overall quiescence.
Another relevant finding to emerge from our study is that different features of the environment may be more or less easy to discriminate from one another and this discrimination may depend on which molecular species is being interrogated. Growth phase, for instance, can be reliably predicted from mRNA concentrations but similar predictions from protein concentrations were less accurate. A possible explanation for this observation may be the differences in life cycles between mRNAs and proteins [36,48]. Given the comparably slow degradation rates of proteins, a large portion of the stationary-phase proteome is likely to have been transcribed during exponential-phase growth. As another example, carbon sources can be reliably predicted from protein concentrations, but the accuracy of carbon source predictions from models trained on mRNA concentrations was more limited. Carbon assimilation is known to be regulated by post-translational regulation [49][50][51], which may be a possible reason for this finding (Fig 7, S9 Fig).
We investigated over 150 samples spanning 16 unique conditions, but a limitation of our work and conclusions is nevertheless sample size (though our study is comparable to or larger than similar multi-conditional transcriptomic and/or proteomic studies [12,[52][53][54]). The comparison between all available data with the more limited set that includes only the samples for which we have both mRNA and protein abundances indicates that prediction accuracy decreases as the size of our training sets gets smaller (152 vs 102 mRNA samples, Fig 3 compared to Fig 5), strongly implying that training set sizes limit overall model accuracy for at least a portion of our results. A second but related possible issue with our study is associated with sample number bias [55][56][57]. We made corrections with weight factors [58,59] and used the multi-class macro-F 1 score [60] to account for the fact that some conditions contained more samples than others, but the predictability of individual conditions nevertheless increased with the number of training samples for that particular condition (S3 Fig). Accuracy limitations could be more thoroughly evaluated through the use of learning curves to determine whether test set accuracies plateau with increasing training set size, but the class imbalance problem and fairly low number of overall samples per condition in our data make it difficult to evaluate accuracies across a broad range of training set sizes. Future work with larger sample numbers will be useful to interrogate whether accuracies are ultimately limited by training set sizes or by some other features inherent to the data and/or methods.
Another caveat of our study is our choice of score that we used to both optimize hyperparameters during the training phase and report for our test set accuracies. The most comprehensive and intuitive evaluation of our results is contained within confusion matrices (Fig 4); collapsing these data-rich matrices into a single number is convenient but can also be problematic. Quantifying the accuracy of multi-class classifiers (simultaneously predicting 4 separate vectors) is challenging and standards are generally lacking but the multi-class macro-F 1 score provides an intuitive scale (ranging from 0 to 1, with 1 representing perfect accuracy) and should account for all possible errors by averaging across predictions for each class. We recognize that the use of other scoring schemes, such as multi-class AUROC [61,62], could alter the model fits during the training phase and the final reported accuracies but the magnitude of these differences should be minor.
We also chose to evaluate different machine learning models throughout this manuscript to ensure the robustness of results and to determine if model choice had a substantial impact on classification accuracy. Overall, we found that the three SVM models performed equivalently to one-another and outperformed random forest models on most tasks. While machine learning models can be difficult to interrogate owing to data transformations, linear kernel SVM models return interpretable output that can be used to determine the most important features and therefore would be preferred for future work in this space given the seeming equivalence between linear, sigmoidal, and radial kernel models. The differences between all models were minor, however, and this finding shows that the accuracy of our classification task is robust to different assumptions.
Our study is a proof-of-principle, demonstrating that gene expression patterns of natural species may provide useful information for assessing various aspects of the environment. Other research has shown that the microbial species composition, derived from meta-genomic sequencing, may be useful for determining the presence of particular contaminants [6]. Our results suggest that further incorporation of species-specific gene expression patterns can likely improve the accuracy of such methods. While genetically engineered strains may play a similar role as low-cost environmental biosensors, we show that-with enough training data-the macromolecular composition of natural populations may provide sufficient information to accurately resolve past and present environmental conditions.

Data preparation and overall analysis strategy
We used a set of 155 E. coli samples previously described [35,36]. Throughout this study, we used different subsets of these samples in different parts of the analysis. For "mRNA only" and "protein only" analyses we used all 152 samples with mRNA abundances and all 105 samples with protein abundances, respectively. For performance comparison of machine learning models between mRNA and protein abundances we used the subset of 102 samples that have both mRNA and protein abundance data. After selecting appropriate subsets of the data for a given analysis, we added abundances from technical replicates, normalized abundances by size factors calculated via DeSeq2 [63], and applied a variance stabilizing transformation [64,65] (VST).
For each separate analysis, we divided the data into two subsets, (i) the training/validation set and (ii) the test set, using an 80:20 split (Fig 2). This division was done semi-randomly, such that our algorithm preserved the ratios of different conditions between the training/validation and the test subsets. We retained the condition labels in the training/ validation data (thus our learning was supervised) but we discarded the sample labels for the test set. We then applied frozen Surrogate Variable Analysis [66] (fSVA) to remove batch effects from the samples. This algorithm can correct for batch effects in both the training & tune and the test data, without knowing the labels of the test data. After fSVA, we used principal component analysis [67] (PCA) to define the principal axes of the training/validation set and then rotated the test data set with respect to these axes. We then picked the top 10 most significant axes in the training/validation dataset for learning and prediction. Finally, we trained and tuned our candidate machine learning algorithms with the dimension reduced training/validation dataset and then applied those trained and tuned algorithms on the dimension-reduced test dataset to make predictions. This entire procedure was repeated 60 times for each separate analysis (Fig 2).
We used four different machine learning algorithms: SVM models with (i) linear, (ii) radial, and (iii) sigmoidal kernels, and (iv) random forest models. We used the R package e1071 [68] for implementing SVM models and the R package randomForest [69] for implementing random forest models. SVMs with radial and sigmoidal kernels were set to use the c-classification [70] algorithm.

Model scoring
Our goal throughout this work was to predict multiple parameters (i.e., growth phase, carbon source, Mg 2+ concentration, or Na + concentration) of each growth condition at once. Therefore, we could not measure model performance via ROC or precision-recall curves, which assume a simple binary (true/false) prediction. Instead, we assessed prediction accuracy via F 1 scores, which jointly assess precision and recall. In particular, for predictions of multiple conditions at once, we scored prediction accuracy via the multi-class macro F 1 score [38,60,71] that normalizes individual F 1 scores over individual conditions, i.e., it gives each condition equal weight instead of each sample. There are two different macro F 1 score calculation that have been proposed in the literature. First, we can average individual F 1 scores over all conditions i [60]: where h� � �i indicates the average and the individual F 1 scores are defined as: Alternatively, we can average precision and recall and then combine those averages into an F 1 score [38]: Between these two options, we implemented the first, because it is not clear that individually averaging precision and recall before combining them into F 1 appropriately balances prediction accuracies from different conditions with very different prediction accuracies.

Model training and tuning
For training, we first divided the training/validation data further into separate training and validation datasets, using a 75:25 split (Fig 2). As before, for the subdivision between training/validation and test data, we did this semi-randomly while trying to preserve the ratios of individual conditions. We repeated this procedure 10 times to generate 10 independent pairs of training and validation datasets. Next, we generated a parameter grid for the tuning process. We optimized the "cost" parameter for all three SVM models and the "gamma" parameter for the SVM models with radial and sigmoidal kernels (S1 Fig). For the random forest algorithm, we optimized three parameters; "mtry", "ntrees", and "nodesize".
We trained each of the four machine learning models on all 10 training datasets and made predictions on the 10 validation datasets. We applied a class weight normalization during training, where class weights are inversely proportional to the corresponding number of training samples and calculated independently for each training run. We calculated macro-F 1 scores for each model parameter setting for each validation dataset and then averaged the scores over all validation datasets to obtain an average performance score for each algorithm and for each parameter combination. The parameter combination with the highest average F 1 score was considered the winning parameter combination and was subsequently used for prediction on the test dataset (Fig 2).

Model validation on external data
We validated our predictions against independently published external data [12]. This external dataset consisted of 22 conditions, of which we could match five to our conditions. For all five samples, Mg 2+ levels were held constant in the external dataset at a level that approximately matched our base Mg 2+ concentrations. The first sample used glucose as carbon source, did not experience any osmotic stress (no elevated sodium), and was collected during the exponential growth phase. The second sample used glycerol as carbon source, did not experience any osmotic stress (no elevated sodium), and was collected in the exponential growth phase. The third sample included 50mM sodium, glucose as carbon source, and was collected in the exponential growth phase. Because our high-sodium samples all included 100mM of sodium or more [35], this third sample fell in-between what we consider "base" sodium and "high" sodium. Samples four and five used glucose as carbon source, did not experience osmotic stress, and were measured after 24 and 72 hours of growth, respectively. In our samples, we defined stationary phase as 24-48 hours and late stationary phase as 1 to 2 weeks [35]. Thus, sample four matched our stationary phase samples and sample five fell in-between our stationary and late-stationary phase samples.

Statistical analysis and data availability
All statistical analyses were performed in R. All processed data and analysis scripts are available on GitHub: https://github.com/umutcaglar/ecoli_multiple_growth_conditions (permanent archived version available via zenodo: 10.5281/zenodo.1294110). mRNA and protein abundances have been previously published [35,36]. Raw Illumina read data and processed files of read counts per gene are available from the NCBI GEO database [72] (accession numbers GSE67402 and GSE94117). Mass spectrometry proteomics data are available via PRIDE [73] (accession numbers PXD002140 and PXD005721). The number of mis-predicted labels (x-axis) indicates how many of the 4 possible condition variables that an individual prediction got wrong. 0 mis-predicted labels (the majority in both cases) means that model predictions were 100% accurate. In both cases (mRNA and protein), when an incorrect prediction was made, it was most frequently due to a single variable being incorrectly predicted (number of mis-predicted labels with a value of 1) as compared to errors predicting more than one variable for a given condition (2 and 3 mis-predicted labels). (TIFF)

S5 Fig. Prediction accuracy for specific growth conditions for intersection mRNA data.
Rows represent true conditions and columns represent predicted conditions. The numbers in the cells and the shading of the cells represent the percentage (out of 60 independent replicates) with which a given true condition is predicted as a certain predicted condition. Predictions based on mRNA abundances, generated by using subset of mRNA samples which has matching protein pairs. Results are shown for the SVM with radial kernel, which was the best performing model in the tuning process on mRNA data, where it won 48 of 60 independent runs. The average of the diagonal line is 44.1% and multi class macro F1 score is 0.43. (TIFF)

S6 Fig. Prediction accuracy for specific growth conditions for intersection protein data.
Rows represent true conditions and columns represent predicted conditions. The numbers in the cells and the shading of the cells represent the percentage (out of 60 independent replicates) with which a given true condition is predicted as a certain predicted condition. Predictions based on protein abundances, generated by using subset of protein samples which has matching mRNA pairs. Results are shown for the SVM with sigmoid kernel, which was the best performing model in the tuning process on mRNA data, where it won 47 of 60 independent runs. The average of the diagonal line is 52.3% and corresponding multi class macro F1 score is 0.53. (TIFF) S7 Fig. Prediction accuracy for specific growth conditions for intersection mRNA & protein data. Rows represent true conditions and columns represent predicted conditions. The numbers in the cells and the shading of the cells represent the percentage (out of 60 independent replicates) with which a given true condition is predicted as a certain predicted condition. Predictions based on protein abundances, generated by using subset of mRNA & protein samples which has matching pairs. Results are shown for the SVM with sigmoid kernel, which was the best performing model in the tuning process on combined intersection data, where it won 27 of 60 independent runs. The average of the diagonal line is 56.1% and corresponding multi class macro F1 score is 0.57. (TIFF)