Flow Cytometric Single-Cell Identification of Populations in Synthetic Bacterial Communities

Bacterial cells can be characterized in terms of their cell properties using flow cytometry. Flow cytometry is able to deliver multiparametric measurements of up to 50,000 cells per second. However, there has not yet been a thorough survey concerning the identification of the population to which bacterial single cells belong based on flow cytometry data. This paper not only aims to assess the quality of flow cytometry data when measuring bacterial populations, but also suggests an alternative approach for analyzing synthetic microbial communities. We created so-called in silico communities, which allow us to explore the possibilities of bacterial flow cytometry data using supervised machine learning techniques. We can identify single cells with an accuracy >90% for more than half of the communities consisting out of two bacterial populations. In order to assess to what extent an in silico community is representative for its synthetic counterpart, we created so-called abundance gradients, a combination of synthetic (i.e., in vitro) communities containing two bacterial populations in varying abundances. By showing that we are able to retrieve an abundance gradient using a combination of in silico communities and supervised machine learning techniques, we argue that in silico communities form a viable representation for synthetic bacterial communities, opening up new opportunities for the analysis of synthetic communities and bacterial flow cytometry data in general.


Introduction
Microbial communities are primary contributors in most biogeochemical processes on Earth [1]. As such, a large portion of microbial research has been dedicated to the study of the structure and functionality within microbial communities of various complexities. Historically, these aspects have been largely inferred from research with axenic cultures. Nowadays, the availability of next-generation sequencing technologies has shifted the focus towards the study of microbial taxa directly in their respective environment ('omics). However, both approaches suffer from either a lack of complexity (axenic cultures) or a lack of controllability ('omics) [2]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 To cope with these bottlenecks, synthetic microbial communities, assembled through the selection of individual microbial populations, and studied under controlled environmental conditions, have recently been suggested as promising intermediary platforms [2][3][4][5]. Advanced cultivation methods have allowed researchers to construct defined and diverse synthetic bacterial consortia for both ecological and biotechnological research [6]. Depending on the goal of the study, these synthetic bacterial consortia may consist out of several [5,7] to more than ten taxa [8,9]. The goals of these studies can be manifold; on the one hand, enhanced biotechnological conversion processes such as production of biofuels are envisioned [10,11], while on the other hand synthetic communities are used as simplified ecosystem models for developing ecological theories [7,8]. It is worth noting that the latter studies have facilitated advanced experimental design, with large microcosm studies using more than thousands of consortia.
With the number of synthetic ecology studies ever increasing, the analysis of low-complexity community compositions, i.e. quantifying the abundance of each constituent taxon, remains the most significant challenge. A study by Saleem et al. used traditional plate counting, which entailed the cultivation of all individual members on agar plates followed by subsequent enumeration of the colony forming units (CFU) [7]. Their counting approach benefited from the fact that each microbial population in their study had distinct morphological characteristics. However, cultivation-based enumeration inevitably suffers from a significant source of bias, since lab cultures frequently adopt a viable but non-culturable state (VBNC) [12]. This results in inflated numbers of false negative counts, and as such, to severe underestimations of population densities.
Other studies, such as the one by Mee et al., applied quantitative PCR (qPCR) [9]. Yet, while successful for their Escherichia coli mutants, the analysis of complex synthetic communities that consist of diverse taxa (e.g., mixtures of Gram-positive and Gram-negative bacteria) faces considerable bias due to taxon-dependent nucleic acid extraction efficiencies, varying amplification efficiency and also primer selectivity. In extremis, this has limited studies with complex synthetic communities to relate their temporal observations only to the initial community composition [8]. Overall, there exists a lack of streamlined and validated methods to monitor the composition of synthetic consortia.
Flow cytometry (FCM) offers a multiparametric description of individual cells, which can be applied to study microbial communities [13,14]. As the speed of measurement is increasing (up to 50,000 of cells per second), alongside with the dimensionality of the data, the number of computational and statistical methods and applications, shortly dubbed as FCM bioinformatics, is growing accordingly [15].
The main goal of this paper is to explore in a systematic way the possibilities of using FCM data to identify bacterial single cells, in order to be able to characterize the composition of synthetic bacterial communities. We will do this by introducing the concept of in silico communities. These are communities created by an aggregation of FCM data coming from axenic cultures which are being measured separately through FCM. The great advantage of using this approach is that we know which cell stems from which bacterial population. This enables us to apply a supervised machine learning approach, which has shown previous success in the recognition of leukemia [16] or to find markers which are able to discriminate between tumor and normal cells in lung cancer [17]. More specifically, artificial neural networks have been used to identify various populations of phytoplankton [18,19]. Applied to bacterial populations, this approach has been used to analyze the effect of various cocktails of fluorescent staining [20] or to analyze the extent to which individual cells can be classified using multiple scatter signals [21]. However, the number of populations used in these latter studies is small, studying only pairwise combinations of two taxa.
In the first part of the paper we analyze to what extent data coming from FCM can be used in order to separate microbial populations at the single-cell level. To do so we have cultivated twenty axenic cultures and characterized them by FCM. We performed single-cell predictions using Linear Discriminant Analysis (LDA), an established method for performing multivariate analyses in microbial ecology [22], and a Random Forest classifier, a robust classifier known for its high performance in various applications [23].
In the second part we show to which degree an in silico community is able to identify a synthetic bacterial community. This is not a foregone conclusion due to the heterogeneous character of bacterial populations, which is reflected in FCM data [24]. In order to do so we created so-called abundance gradients, a combination of in vitro communities which consist out of two populations in varying abundances. We will show that we are able to retrieve these relative abundances, using a classifier trained on an in silico community; this result enables researchers to perform a supervised analysis of synthetic microbial communities.
In the third part of the paper we estimate to what extent bacterial communities can be analyzed for higher population complexities, i.e., in a multiclass setting. To do so we created and evaluated in silico communities containing more than two populations. The results show that our approach is valid for communities of lower complexities, furthermore FCM gives rise to data that should be feasible for higher complexities as well. A schematic overview of the proposed method can be found in Fig 1.

Classification performances on binary in silico communities
The performances using LDA and a Random Forest classifier were calculated for all possible pairwise combinations considering twenty populations for S = 2, S denoting the number of populations making up a community, i.e., the population richness. This results in 190 in silico communities, where the same amount of cells was sampled for each population, thus creating evenly distributed in silico communities. We calculated the mean for the area under the ROC curve (AUC) and the accuracy (acc), accompanied with their standard deviations and the percentage of communities which reported a score higher than 0.90; results are reported in Table 1.
We conclude that for a majority of in silico communities we are able to perform single-cell predictions up to high performances, especially when using Random Forests; in this case more than half of our communities report results higher than 0.90 for both AUC and the accuracy. Our highest performances top off at an AUC of 0.999 and an accuracy of 0.996. To further illustrate our findings we have visualized the AUC and the accuracy for all in silico communities, where performances have been ranked in descending order according to the results of applying a Random Forest classifier (Fig 2).
On average we see that a Random Forest classifier performs better than LDA. However, it is not always necessary to use a 'black-box' non-linear classifier such as a Random Forest. For some of the in silico communities we see that the performance of LDA is similar to the performance of Random Forests; 45% of the in silico communities report an increase in AUC of less than 0.03, 17% report an increase in accuracy less than 0.03. Moreover, note that pairwise combinations of populations give rise to performance accuracies ranging from 99% to near random guessing predictions. Hence, our dataset is highly representative, that is, we were not biased towards highly discriminative populations.

Predicting the abundance gradient
An abundance gradient consists out of a set of bacterial communities containing two populations in varying abundances. We constructed these gradients for three combinations of bacterial populations, combinations for which we initially reported a low (Comb. 1), medium (Comb. 2) and high performance (Comb. 3) respectively. We created these gradients in vitro, but, because we measured the bacterial cultures separately beforehand through FCM, we were also able to construct these gradients in silico. In order to explore to what extent in silico communities can be used to identify synthetic bacterial communities, we have predicted the relative abundances of both in silico and in vitro abundance gradients, using LDA and a Random Forest classifier trained on a full evenly distributed in silico community. Ideally, a classifier which is able to achieve a high AUC and accuracy on a held-out test set of this in silico community gives rise to a well-predicted abundance gradient, both in silico and in vitro. The predicted abundance gradients are visualized in Fig 3. As expected, the predicted abundance gradients for Comb. 2 and 3 match the target abundance gradients (Fig 3C, 3D, 3E and 3F), whereas this is not the case for Comb. 1 (Fig 3A and  3B). We highlight the similar behavior for the in silico gradient (left panel) and the in vitro gradient (right panel). First, we note a systematic bias using LDA for Comb. 3; although trained on an evenly distributed in silico community, the classifier systematically favors the S. oneidensis population. Table 1. Performances using LDA and Random Forests (RF) for S = 2. Both classifiers were trained on 70% of the data for all 190 in silico communities, after which they predicted the population to which individual cells belong contained in 30% held-out test sets. We denote the mean AUC (μ AUC ) and accuracy (μ acc ), along with their standard deviation (σ AUC/acc ) and the percentage of communities reporting a performance of 0.90 or higher. Second, we note that for Comb. 2 the predicted gradients highly overlap; this means that an analysis using LDA and a Random Forest gives rise to very similar results. This is however not the case for Comb. 3, where the use of Random Forests results in a gradient that lies closer to the target gradient, which is reflected for both the in silico and the in vitro analysis. These observations are reflected in the root mean squared error (RMSE), which is calculated between the predicted gradients and the known target abundance gradients ( Table 2). The RMSE for the in silico analysis can be interpreted as the most optimal value to achieve for a classifier when analyzing an in vitro community. We see that the RMSE gives comparable results when performing an in silico or in vitro analysis for Comb. 3, this is however not the case for Comb. 1 or 2. This can result from experimental noise when creating in vitro gradients. Using the knowledge that FCM analyses generally do not exceed a 5% instrumental error [25], we performed a comparable analysis in terms of the Hill number of order one, i.e., the exponential of the Shannon diversity, noted as D 1 [26]; this diversity index gives information concerning the evenness of a community (see Appendix: Alpha diversity analysis). Because mathematical properties of this index allow us to combine uncertainties for all relative abundances characterizing a microbial community, we can calculate confidence intervals (CI) within which our in vitro target abundance gradient should lie. Inspecting the results, we see that our predicted abundance gradients for most communities in Comb. 2 (both LDA and Random Forests) and 3 (Random Forests) lie within the 68%-CI; all of them lie within the 95%-CI (S4 Fig).
The results for the in vitro analysis of Comb. 2 and 3 are similar, although we would expect from initial performances that these values would be different. To investigate this issue, we added additional results in Table 3, for which we report the performance of a classifier on a held-out test set of the new in silico communities in terms of the accuracy and the AUC, compared to the original values calculated in the previous section ( Ã ). In order to be able to make a comparison, classifiers were trained and evaluated in exactly the same way.
We note that although the performances are similar for Comb. 3, this is not the case for Comb. 1 and 2. Whereas the performances for Comb. 1 initially reported higher, the performances for Comb. 2 initially reported lower. This could explain why the RMSE for the in vitro analysis for Comb. 2 and 3. has similar precision. However, this implies that although our approach is fruitful to analyze synthetic communities, performances are not yet reproducible when axenic cultures are characterized by FCM at different time points. Table 3. Performance comparison for the in silico communities that are present in both dataset 1 and 2. Classifier performance comparison on a held-out test set for dataset 1 (denoted with *) and 2 for those in silico communities that are present in both datasets. These in silico communities are constructed and used in exactly the same way, that is, they are evenly distributed communities consisting out of the same number of cells and made up out of the same bacterial taxa. Classifiers are trained on 70% of the data and evaluated on the opposite 30% data. Evaluation of higher complexity in silico communities In order to explore to what extent single-cell predictions can be made when we increase the population richness, we created in silico communities in a multiclass setting. We used the same approach as in the binary setting, but now we let S vary from 2, . . ., 20. To keep it computationally feasible we chose 150 different in silico communities at random for every increment in S (except for S = 19 and S = 20, where we only have 20 and 1 different combinations respectively at our disposal). To quantify our results we calculated the mean accuracy for every S; results are displayed in Fig 4. For all values of S our approach is able to make single-cell predictions significantly better than random guessing. As S increases, both the mean accuracy and the size of the confidence interval decreases. As the richness increases, the degree in overlap between populations in the multiparametric 'FCM-space' starts growing accordingly. Therefore it is harder for classifiers to make a distinction between populations, which results in performances that are lower and more centered.
The difference in performance between the two classifiers increases as S increases. This means that for communities with a low richness (S = 2, 3) LDA might provide a sufficient method to make single-cell predictions, but as S increases Random Forest will be a better option for most communities. This also implies that although for low S a linear combination of variables already discriminates populations quite well, predictions can be improved by

Discussion
Using the concept of in silico communities, we are able to use supervised machine learning techniques to taxonomically identify bacterial cells up to high accuracies based on FCM data. We note that this approach has not yet been adopted to analyze the composition of synthetic bacterial communities. A possible reason for this is the lack of incorporating these methods in standard FCM software [27].
Using a full combination of fluorescence and scatter signals, we demonstrated that using 'off-the-shelf' classifiers without further data manipulation already results in acceptable to high performances for low population richness. Compared to previous research, we note that Rajwa et al. were not able to use LDA in order to make proper single-cell predictions [21]. While they were limited to the combination of scatter signals, we also incorporated fluorescence parameters in our analysis, thereby improving the amount of single-cell information that is acquired. In our study, we applied a single staining approach; there exists, however, a wide array of fluorescent viability markers, all of which may harbor additional single-cell information [28,29]. Preliminary observations have already revealed the differential behavior of bacterial taxa to these staining protocols [28]. As the number of dimensions and the amount of fluorochromes describing a single-cell is increasing, we expect our approach only to gain in utility in the near future. A natural extension of this research would be to find the optimal classification method to analyze FCM data, which should be extensible to a multiclass setting; a number of possibilities exist, ranging from binary classifiers which are naturally extendable to a multiclass setting or a combination of binary classifiers using a one-versus-one (OVO) or one-versus-all (OVA) approach [30].
Although it has been briefly mentioned in literature that an in silico community can be representative for its in vitro counterpart [20], there is a lack of rigorous studies proving this observation. We feel that this question has been answered more thoroughly by systematically retrieving the composition of synthetic communities across an abundance gradient. The results imply that in silico communities form a valid representation of synthetic communities. However, although the performance of classifiers gives a good indication to what extent populations are distinguishable, it is not always possible to reproduce the classifier performance in different experiments. This observation can be attributed to two sources of variation, namely technical variability and biological variability. It has been shown that both sources give rise to heterogeneity in FCM data when studying bacterial axenic cultures [24], although it is difficult to distinguish between one another [31].
Technical variability has been suggested to arise from the time-dependent bleaching and leaking of fluorochrome molecules [32]. Its effect on the classifier performance becomes clear when conducting an in silico performance evaluation using individual replicates (instead of pooling them, as they are measured in duplicate). Creating two sets of replicates for S = 2, A & B, we see that for a significant number of combinations the difference in classifier performance is noteworthy, with a mean difference of 2% and a standard deviation of 11%. For clarity, we added the Random Forest performances (A, B and pooled) for all in silico communities in S1 File. However, referring to the results of the in vitro analysis, we note that pooling replicate samples compensates this experimental bias and is sufficient to retrieve the composition of an in vitro community. In order to reduce technical variability as much as possible, we do suggest to include a higher number of replicates for future experiments. To find this number, the strategy of Davis et al. can be followed [33], which suggests that less than five replicates (but more than two) are sufficient for most experiments.
Biological variability is another and perhaps more important factor to take into account when analyzing microbial communities with FCM. Vives-Rego et al. hypothesize that biological variability in FCM stems from cell size diversity and cell cycle variations [24]. In this study we tried to control for this variability by focusing on cultures in the stationary growth phase, so that we could directly compare the performance of the analysis. Yet overall, the multitude of biological processes that result in single cell physiological variation still remain largely undefined [34]. Results of this research comply with motivations that FCM can be used to further characterize bacterial heterogeneity and physiology [35][36][37], for which a holistic approach has been proposed [13].
To do so, a more comprehensive protocol is required to make our in silico approach fully operational, a need which has been pointed out before [38]. This protocol includes further improvement of data-analysis techniques, such as automated denoising, but also a more developed methodology to reduce sources of variability, both of instrumental and biological origin. However, we believe that the combined approach of microbial flow cytometry and machine learning supports this endeavor, and this will be the main focus of further research.
For now, in silico communities can already be exploited for various purposes. For environments where limited physiological variation in the axenic populations is expected, or where the in silico populations have been defined for all possible physiological states, our approach can be used to retrieve the community composition for low-complexity microbial communities. Furthermore, by using evaluation tools for classifiers such as the accuracy or the AUC, one can quantify which populations are distinguishable and which are not. One intuitive tool which is extensible to a multiclass setting, is the use of a confusion or misidentification matrix. This allows one to inspect which populations are likely to overlap and which are not; an example is given in S1 Fig. Secondly, as we have shown that in silico communities form a viable representation of their in vitro counterparts, we are allowed to extrapolate properties of in silico communities to in vitro microbial communities. This means that in silico communities can be used as a stand-in for in vitro communities, enabling us to use them to develop new data-driven techniques, which will ultimately lead to novel applications for microbial FCM.

In silico communities
An in silico community consists out of an aggregation of data coming from axenic cultures, which are being measured separately through FCM. As we have twenty axenic cultures at our disposal, the population richness (S) of an in silico community varies from S = 2, . . ., 20.

Learning in silico communities
Each bacterial population was sampled in equal size. We randomly subsampled N ax = 5,000 cells per axenic culture. This means that an in silico community consists out of N tot = S × N ax cells. We used 70% of an in silico community to train a classifier, this is the training set; the other 30% was held-out and used to evaluate the performance of a classifier, the test set.
For S = 2 we evaluated the performance of LDA and the Random Forest classifier for all possible pairwise combinations, which is 190. For increasing S, i.e. the multiclass setting, we evaluated the performance for 150 randomly chosen combinations for S = 2, . . ., 18 (for S = 19 and S = 20 we chose the maximum number of combinations, which is 20 and 1 respectively), in order to keep it computationally feasible. For every increment of S we calculated the mean accuracy, averaging the accuracies for all 150 randomly chosen in silico communities.

Learning in silico communities to predict the abundance gradient
We used the concept of an abundance gradient to prove that properties of in silico communities can be used for the identification of their in vitro counterparts. An abundance gradient consists out of a set of microbial communities where populations have been mixed in varying abundances. We created an abundance gradient both in silico and in vitro for three combinations of two populations, with abundances ranging from 1% to 99% for the one population and vice versa for the other. This was possible as we measured the axenic cultures separately through FCM beforehand. We chose three different combinations of two populations to create abundance gradients, combinations which initially reported a low, medium and high performance respectively, based on the performance of the Random Forest classifier for S = 2. For every community in an abundance gradient we sampled 10,000 cells (both in silico and in vitro), except for one in vitro community of Comb. 3, for which we were not able to register enough cells (see further on).
We trained a classifier on an evenly sampled in silico community to predict the label of individual cells for all communities in an abundance gradient. We sampled N ax = 5,000 cells per bacterial population to create the in silico community upon which we trained our classifier; as the abundance gradient acts as our test set, we trained our classifier on the full in silico community. Note that we have cultivated and measured new axenic cultures in order to create both the in silico and in vitro abundance gradients.

Datasets
Dataset 1: axenic cultures. Twenty bacterial populations were gathered from publicly available culture collections, of which a full list can be found in Table 4. Populations with To create abundance gradients, we chose three combinations (Comb.) of two bacterial populations based on Random Forest performances calculated during the in silico analysis of dataset 1 ( Table 5). We measured the exact cell densities of both bacterial cultures through FCM, and used them to calculate the required volumetric proportions to construct a relative abundance gradient of 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% and 99%. After 24h of growth, cells were diluted in 0.2 μm filtered PBS for FCM measurement. Based on these cell densities both cultures were diluted in 0.2 μm filtered PBS to an equal cell density of approximately 10 8 cells mL -1 , which was verified through an additional FCM measurement. The equal density suspensions were then mixed in the required proportions to final volumes of 500 μL. All samples were subsequently measured in triplicate through FCM (for Comb. 2 the axenic cultures were measured in quadruplicate). >10,000 cells were registered for each measurement, except for Comb. 3 where we registered 3.084 cells for 1% of S. oneidensis.
As we measured the populations separately beforehand through FCM, we were also able to construct an abundance gradient in silico, by sampling communities according to the same relative abundances as described above. Data preprocessing. The FCM data were denoised from (in)organic and instrument noise by means of a reproducible digital gating strategy in the arcsinh(x) transformed FL1-FL3 bivariate space, following the guidelines by Hammes et al. [39] and Prest et al. [40]. This filtering strategy was verified by negative controls (non-stained samples) and kept fixed for all samples of the same axenic culture and within each abundance gradient. An example of the gating stratey has been given for the 40%/60% abundance files for all three combinations used to create abundance gradients (S3 Fig). Filtered data files were exported as individual FCS files with

Classifiers
Linear Discriminant Analysis. Linear Discriminant Analysis (LDA) is a linear classifier which tries to find the optimal linear combination of features in order to separate objects or classes. It assumes the data are distributed according to a Gaussian distribution. It has no hyperparameters to tune and is able to handle problems in the multiclass setting in a natural way. For more information, see [41] or chapter 4.3 in [42].
Random Forests. A Random Forest classifier is an example of an ensemble method, a method in which various classifiers are trained and in which a majority vote is taken to predict the outcome of an unknown sample. In this case the ensemble consists out of decorrelated unpruned trees grown on bootstrap samples. The trees are decorrelated because at every split only a random subset of the total number of K variables is available (K = 12). This results in a decrease in variance for only a slight increase in bias, hence lowering the overall classification error. For more information see [43] or chapter 15 in [42].
We grew 200 trees when training a Random Forest and chose the gini criterion when making a split. We note that there is no need to tune the number of features that are available to choose from when making a split. We applied the preset ffiffiffi ffi K p , which resulted in (near-)optimal results, in accordance with [44]. This has been verified by comparing the performance for twenty randomly chosen in silico communities for S = 2, . . ., 19 using the preset ffiffiffi ffi K p as opposed to determining this value by 10-fold cross-validation. The increase in accuracy never reported higher than 0.7%.

Performance measurement
We used various performance metrics in order to evaluate our methodology. We evaluated the in silico analysis in terms of the accuracy and the area under the receiving operating characteristic curve (AUC). The in vitro analysis is expressed in terms of the root mean squared error (RMSE).
Accuracy. The accuracy can be defined in the following way: where N denotes the total number of elements to predict,ŷ the predicted label of an element, y the true label and 1 the indicator function, which returns the value of 1 when its argument is true and 0 otherwise. It can also be expressed in terms of the true positives (tp), the number of correctly predicted elements belonging to a certain class j, true negatives (tn) the number of correctly predicted elements not belonging to class j, false positives (fp), the number of incorrectly predicted elements belonging to class j and false negatives (fn), the number incorrectly predicted elements not belonging to class j. In this setting, the accuracy can be written as: where S denotes the total number of classes.

Flow Cytometric Single-Cell Identification of Bacterial Populations
Area under the receiver operating characteristic curve. The AUC measures the area under the receiving operating characteristic (ROC) curve and can be used as a performance measurement for a binary classifier [45,46]. The ROC curve is a curve which is constructed by calculating the tp rate versus the fp rate for various thresholds. These thresholds can be determined for classifiers which assign probabilities to predictions; this is the case for both LDA (applying Bayes' theorem) and for Random Forests (applying a majority vote for the ensemble of trees).
Calculating this area results in a number between 0 and 1; the higher this number, the better the performance of a classifier. The AUC can be interpreted as the probability that a classifier will rank a randomly chosen positive higher than a randomly chosen negative. Using the AUC has a number of favorable properties. Most notable are the fact that it gives an indication of how well separated the positive and negative class are and that it is insensitive to prior skewness concerning class distributions.
Root mean squared error. Expressing the known relative abundance as p, opposed to the predictedp, the RMSE becomes: with n being the total number of bacterial communities constituting an abundance gradient. Therefore when the set of predictions are close to the ground truth, the RMSE lies close to zero. Confusion matrix. A confusion matrix is a tool which helps to describe the performance of a classifier. It reports the tp, tn, fp and fn, and is naturally extendable to a multiclass setting. In this way one can inspect to what extent a classifier 'confuses' certain labels of classes. An example for the binary setting is given in Table 6.
Applied to the use of in silico communities, one is able to inspect which populations are easily separated by a classifier and which populations have a similar FCM fingerprint.

Computational Tools
Code availability. Our code has been made available on github: https://github.com/ prubbens/InSilicoFlow. Data availability. Our data has been made freely available in .fcs format on the Flow-Repository database [47], and can be found using the following identifiers: • Axenic cultures: FR-FCM-ZZSH.
flowCore. The data has been preprocessed and exported using flowCore, a package of computational Tools written in R for the analysis of FCM data [48].

Scikit-learn. Scikit-learn
is an open-source library of various machine learning methods, which can be used in Python [49]. We used its implementation to perform LDA and Random Forests, to calculate the AUC and accuracy, and to perform cross-validation.