Machine learning identification of Pseudomonas aeruginosa strains from colony image data

When grown on agar surfaces, microbes can produce distinct multicellular spatial structures called colonies, which contain characteristic sizes, shapes, edges, textures, and degrees of opacity and color. For over one hundred years, researchers have used these morphology cues to classify bacteria and guide more targeted treatment of pathogens. Advances in genome sequencing technology have revolutionized our ability to classify bacterial isolates and while genomic methods are in the ascendancy, morphological characterization of bacterial species has made a resurgence due to increased computing capacities and widespread application of machine learning tools. In this paper, we revisit the topic of colony morphotype on the within-species scale and apply concepts from image processing, computer vision, and deep learning to a dataset of 69 environmental and clinical Pseudomonas aeruginosa strains. We find that colony morphology and complexity under common laboratory conditions is a robust, repeatable phenotype on the level of individual strains, and therefore forms a potential basis for strain classification. We then use a deep convolutional neural network approach with a combination of data augmentation and transfer learning to overcome the typical data starvation problem in biological applications of deep learning. Using a train/validation/test split, our results achieve an average validation accuracy of 92.9% and an average test accuracy of 90.7% for the classification of individual strains. These results indicate that bacterial strains have characteristic visual ‘fingerprints’ that can serve as the basis of classification on a sub-species level. Our work illustrates the potential of image-based classification of bacterial pathogens and highlights the potential to use similar approaches to predict medically relevant strain characteristics like antibiotic resistance and virulence from colony data.


Introduction
Since Semmelweis, Lister, and Koch began studying pure cultures of microorganisms, scientists have tried to organize the immense diversity of bacteria into orderly categories to understand their behavior, including pathogenesis.In particular, Koch pioneered the method of determining how specific microorganisms cause distinct diseases and highlighted the importance of laboratory culture for the identification and targeted treatment of pathogenic bacteria [1].By identifying the specific bacterial cause of disease, we can better manage and eradicate the infection.When a sample from an environment or clinical source is plated on solid media at sufficient dilution, individual colonies can be observed, where each colony grows from a single founding bacterial cell (Fig 1A).These clonal groups have characteristic sizes, shapes, Bacterial colony morphology varies across and within species.A) Morphological identification of bacterial species from a mixed culture plated on Chocolate Agar: Rothia mucilaginosa (smallest white circles), Neisseria subflava (larger tan round circles), and Streptococcus mitis (large bullseye circles).B) A common example of 2D colony morphology features include the appearance of the colony edges.C) Morphological identification of bacterial strains, in this case called "smooth", "wrinkly spreader", and "fuzzy spreader" [4], from a culture containing only Pseudomonas fluorescens plated on King's Medium B Agar. https://doi.org/10.1371/journal.pcbi.1011699.g001PLOS COMPUTATIONAL BIOLOGY edges, textures, and degrees of opacity and color (Fig 1B) [2].These characteristics describe the morphology of a single colony and formed the basis for the earliest attempts at categorization.In modern clinical laboratories, morphological tests play a role but are increasingly replaced by biochemical and genomic methods of classification [3].
While initial classification was morphology based, advances in genome sequencing technology have revolutionized our ability to classify bacterial isolates.Widespread use of 16S rRNA amplicon sequencing not only allows for robust taxon identification to the genus scale [5,6], but also fosters phylogenetic organization of bacterial taxa [7,8].Marker gene or whole genome sequencing provides finer scale strain identification and allows specific focus on markers of pathogenicity (via identification of virulence factors), drug resistance or other characteristics of interest.Transcriptomic sequencing provides a more direct window into the behavior of pathogens, to assay for example whether virulence factor or drug resistance genes are actually being expressed [9].
While genomic methods are in the ascendancy, morphological characterization of bacterial species has made a resurgence.How a population of bacteria grow is a direct result of the interplay between their genotype, phenotype, and environment [10] and a priori contains critical phenotypic information on the behavior of individual microbes.For example, colony morphotype can indicate specific mutations in a bacteria, e.g.small colony variants and cyclic di-GMP regulation [11] or rugosity and phenazine production [12].Morphological characterization and data science have collided to provide insights into image-based prediction of microbial taxa [13][14][15][16][17][18].The earliest attempts introduced light-scattering for presence-absence detection of microorganisms on surfaces [19,20], which paved the way for image-based identification of Listeria species from light scattering patterns of Listeria colonies [13].This developed into an automated light scattering sensor called BARDOT (BActerial Rapid Detection using Optical scattering Technology) that produced scatter patterns sensitive enough to distinguish between Escherichia coli serotypes [14], virulence mutants [15], and industrial Staphylococcus contaminants [16].At the same time, imaged-based prediction diversified to use more accessible standard light microscopes instead of specialized equipment, putting the focus on new analysis methods like colony morphology ontology [18] and deep learning [17,21].
Deep learning algorithms have gained popularity [17,21,22] as machine learning becomes ubiquitous across fields.Several supervised deep learning models have demonstrated near human accuracy in classification of various image datasets [23] and in the case of bacterial biofilms, deep learning has outperformed human characterization of single and mixed species biofilms [24].Zielinksi et al. [17] investigate several models for colony classification utilizing a deep convolutional neural network (D-CNN) on a rich dataset pre-trained on the ImageNet [25] dataset, achieving a very high validation accuracy of ~97%.However, existing work using a D-CNN approach has focused on classification across species, not across strains within a species.Additionally, a ubiquitous challenge in many biology applications (including the one in this paper) is the limited number of samples.For example, the hallmark ImageNet imagebased dataset [26][27][28] contains more than 14 million images across 1000 classes.Outside a colloquial 'rule of thumb' that you need at least 1000 samples per class, there is no absolute way to determine a 'minimum' sample size [29].
We focus on within species classification of Pseudomonas aeruginosa, a gram-negative bacterium capable of causing severe chronic illness and a major cause of acute hospital-acquired infections [30][31][32][33].While it is widely viewed as a generalist and can be isolated from a wide number of environments including water and soil, it is frequently found in human/animalimpacted environments [30].P. aeruginosa has large variety of mechanisms that allow it to respond to its environment [34][35][36].There are frequently differences in behavior across strains, like type and production of secreted products [37,38], which can result in observable PLOS COMPUTATIONAL BIOLOGY morphological diversity when cultured on agar surfaces [39][40][41][42][43]. Classic examples of morphological variation in P. aeruginosa caused by underlying genetic differences include small colony variants related to cyclic di-GMP regulation [11], rugosity related to phenazine production [12], and mucoidy related to secreted polymer production [44].
In this paper, we use a collection of 69 clinical and environmental P. aeruginosa isolates which present a range of phenotypic features (e.g.exopolysaccharide production, virulence gene expression, and antibiotic resistance profiles) that have been previously described in [45][46][47].From these 69 isolates we generate a library of 266 P. aeruginosa colony images in a training / validation dataset, plus a separately generated 69-image test dataset.As expected, we document a high level of variation in the physical appearance of colonies across strains.Additionally, we see that strain morphology in P. aeruginosa is consistent across replicates.We then use a D-CNN approach and a combination of data augmentation [48,49] and transfer learning [50] to classify strains of P. aeruginosa and achieve an average validation accuracy of 92.9% and an average test accuracy of 90.7%.

Morphological variation across strains and replicates
Individual isolates from our 69-strain collection were spotted onto Luria-Bertani (LB) agar plates supplemented with Congo Red, a dye commonly used in microbiology to stain the biofilm matrix and extracellular polysaccharides (Fig 2).This was done in quadruplicate and generated a training and validation library of 266 colonies after quality control (10 colonies were removed due to debris or writing obscuring part of the image).Colonies were grown for 72 hours, which was determined during the pilot experiment to be enough time to reveal major morphological differences while minimizing outward expansion.This resulted in an imaged library of image colonies that ranged in size from 12M (approximately 0.74cm x 0.74cm) to 183M pixels (approximately 2.8cm x 2.8cm).
In our first approach to characterizing morphological diversity, we use 8 simple descriptive metrics (Fig 3A).Classic morphological descriptive variables include area, radius, perimeter, centroid (the point in which the three medians of the triangle imposed within the image intersect), eccentricity (the ratio of the distance between the foci of the ellipse and its major axis length), circularity (deviation of the perimeter from a superimposed perfect circle), bounding disk center (center of mass of the brightest pixels), and caliper diameter (the caliper dimension obtained by averaging over all orientations).
To expand on these classic descriptive variables, we next introduce complexity metrics (Fig 3B ) used in image processing and computer vision [51,52].Compression ratio is a single variable that describes how repetitive an image is, i.e. how repetitive are certain patterns in the pixel distribution and how many times do they occur?Specifically, it is a measurement of the reduction in size of pixel data produced by, in our case, the traditional JPEG compression algorithm.An alternate approach to quantifying image complexity is using the Sobel-Feldman operator, a popular method to detect edges in images due to its ability to detect gradients.At every pixel in an image, it calculates the pixel intensity of the vertical and horizontal adjacent pixels and expresses the results as vectors.You can then plot these gradient vectors creating a unique profile for every colony.These profiles can then be summarized by describing their distribution (mean, median, standard deviation, skew, root mean square, and kurtosis).Unsurprisingly, we find that many of these metrics are highly correlated (S1 Fig).

PLOS COMPUTATIONAL BIOLOGY
eccentricity and circularity metrics, we see coefficients of variation across replicates are low (standard deviation << mean), and less than the coefficient of variation across strains.This pattern of variation indicates that colony morphology under common laboratory conditions is a robust, repeatable phenotype on the level of individual strains, and therefore forms a potential basis for strain classification.

Classifying strain identity from image data
Considering our initial results indicating robust morphological traits for each strain, we reasoned that morphological data could be used to classify colonies via a deep learning approach.Deep convolutional neural networks (D-CNNs) typically require larger datasets than those found in biology-usually a minimum of thousands of examples to train properly [53].To address this issue, we applied various transfer learning [50] and data augmentation techniques [48,49] (see methods for details).This initial dataset was split into 90% training and 10% validation sets.
High performance on our validation dataset provides some confidence that the CNN approach can successfully classify previously unseen strain images.Yet validation results merit caution, as the validation images were all generated on the same agar plate, imaged on the same day and from the same overnight culture.These shared features raise the risk that a CNN is detecting technical batch effects of the experimental environment (e.g.variations in the PLOS COMPUTATIONAL BIOLOGY lighting conditions) rather than strain-specific features of interest.To address this concern we generated a separate 69-image test dataset (1 image per strain) using distinct overnight cultures and produced by a different experimenter on a different day (following the same protocol).
In a first round of computational experiments, we sought to compare the performance of four transfer learning models (Fig 5 and S1 Table), using our standard data preprocessing and augmentation choices (normalized pixel intensities; shear / zoom / flip / rotation / brightness / scaling data augmentation-see methods).To determine the best approach for our dataset, our performance metrics are accuracy (the number of correct predictions divided by the total number of predictions x100) and loss (a summation of the errors, calculated via cross-entropy, made for each sample).We chose to use cross-entropy for loss instead of negative log-likelihood since it automatically transforms the raw outputs of the neural network into probabilities (via a softmax activation).).
MobileNetV2 performs significantly worse than the other transfer learning models on both accuracy and loss metrics (Tukey HSD, p < 0.05 on all comparisons, see Tables C-F in S1   Our results in Tables 1 and 2 show that D-CNN models are capable of effective bacterial colony classification down to the sub-species (strain) level, and that performance is dependent

PLOS COMPUTATIONAL BIOLOGY
on details of data pre-processing and augmentation.This success does not rule out that other simpler models can also effectively classify strains.To establish simple model benchmarks, we next evaluate the performance of 'shallow learning' classification models (support vector machines, SVMs), trained on our colony metric features (Figs 3 and 4), or on features extracted by our deep learning ResNet 50 model (Table 2).These benchmarking results  highlight that SVM tools cannot match the performance of our D-CNN models (Fig 5 ), whether they are trained on a priori identified colony metric features (Figs 3 and 4) or on features extracted from our successful D-CNN models.

Discussion
Our results show that P. aeruginosa strains have a characteristic morphological 'fingerprint' (Figs 2-4) that enables accurate strain classification via a custom deep convolutional neural network (Fig 5).When trained on morphological data, our most successful model can classify strain identity with a test accuracy that exceeds 90% despite what is considered a "data starved" dataset.Our work points to extensions towards predicting phenotypes of interest (e.g.antibiotic resistance, virulence), and suggests that sample size limitations may be less restrictive than previously thought for deep learning applications in biology.In the following paragraphs we will review these implications, while addressing caveats and next steps for this research.
The existence of a morphological 'fingerprint' that can be classified via machine learning architecture opens the potential to relate strain identity to key phenotypes of interest, such as antimicrobial resistance (AMR) or virulence.While species differ in their levels and types of virulence [54], there are also significant differences in virulence between strains of the same species [55,56].In so far as AMR or virulence is a stable and repeatable property of a strain, our classification model could simply "look up" the recorded values in a dataset based on the strain prediction.Yet this "look up" approach is vulnerable to variation within strains due for instance to gain or loss in virulence factor and AMR genes, or mutations in the regulatory control of these genes.
To address the challenge of predicting phenotypes such as virulence or AMR, a future direction would be to extend our deep learning approach to directly predict phenotypes of interest, either through a classification (qualitative trait) or regression (quantitative trait) framework.Our current analysis shows that morphological 'fingerprints' are sufficient to identify individual strains, but this does not directly imply that we can use image data to successfully predict disease traits (analogously, human fingerprints are sufficient to identify individuals, but are not considered predictive of human disease traits).We speculate that colony 'fingerprints' are likely to allow for trait prediction, due to paths of common causality from bacterial modes of growth to colony morphology and disease phenotypes.Given the established connection between biofilm growth and antibiotic tolerance [57], it is possible that our default growth settings (Fig 2 ) will reveal differences in patterning that correlate with differences in antibiotic Table 1.Performance contribution of data pre-processing, augmentation, and training.To assess contributions, we took the trained ResNet-50 model (Fig 5) as a baseline method and assessed the impact of removing components of our methods pipeline.Each computational experiment was replicated five-fold (five separate training runs for each model), allowing statistical comparison of approaches.Across replicates we report average (+/-standard deviation) accuracy (the number of correct predictions divided by the total number of predictions x100) and loss (a summation of the errors made for each sample).Statistical comparisons are summarized in S2 PLOS COMPUTATIONAL BIOLOGY resistance (and/or tolerance).We further anticipate that our approach would become more discriminatory if imaging data was also collected in the presence/absence of relevant stressors of interest (e.g.antibiotics or immune effectors).

PLOS COMPUTATIONAL BIOLOGY
Deep learning methods are commonly described as 'black box' methodologies [58], presenting challenges for model interrogation and mathematical justification for sample size requirements.Even when classification through deep learning has a monotonically increasing prediction accuracy, the nature of the image collection process in biological research will limit the total size of the dataset.This is a contrast to machine learning applications in other fields that have continually and often exponentially increasing datasets (e.g., spending habits and surveillance cameras).Considering these challenges, arguably our best alternatives for increasing the size of biological datasets for image-based prediction in microbiology are data augmentation and transfer learning, which we combined in this study.Our algorithm transfers learning from the canonical ImageNet dataset while also using standard data augmentation techniques (rotations, shifts, zooms, shears, flips, reflections, brightness.).Future tests to explore the amount of data needed for accurate classification could be run by reducing the size of the dataset by decreasing the amount of data augmentation performed or by reducing the number of replicate colonies used.Moving forward, generative adversarial networks, a class of machine learning frameworks that learns to generate new data by competing two neural networks, is promising technique that could supplement current data augmentation efforts [59,60].
Prior to our data augmentation, we downsampled the original dataset in order to match the image dimensions of the ImageNet dataset which we pretrained on and to minimize computational time.The original dataset contains colony images of 12 million to 183 million pixels (depending on colony size) while our downsized colony images each contained a standardized 60 thousand pixels.The success of strain classification based on these heavily down-sampled images suggests that classification is possible based only on more 'macro' scale morphological features that are still discriminable at this resolution.Yet this down-sampling is essentially throwing away a rich array of more micro-scale image features (Fig 7), and so opens the possibility for future analyses to explore these features.The inset panels in Fig 7 illustrate striking local and finer-scale features that are present across strains.These panels suggest an additional

PLOS COMPUTATIONAL BIOLOGY
data augmentation option, where each colony image is processed into multiple 60K chunks, sampling at different locations within the image and at different spatial scales.Comparing model performance using 60K images taken at different spatial scales could provide valuable information on what are the most diagnostic spatial scales.
Our analysis to date implicitly assumes that our image data represent independent samples of P. aeruginosa morphology, which due to phylogenetic relationships among these strains is not the case.A simple null expectation given these phylogenetic relationships is that strains that are most morphologically similar (i.e., similar morphological metrics, Fig 3; and more commonly confused in classification, Fig 6) will tend to have a shorter phylogenetic distance.Under this phylogenetic null model, strains with minimal phylogenetic differences will have overlapping morphological metrics and will not provide dependable classification results.We do not expect this pattern to hold uniformly, as we know that single-gene changes can generate large-scale changes to colony morphology [11,12,44].Future work can systematically explore the impacts of single gene mutations through automated screening of transposon mutagenesis libraries [61], potentially aiding in the discovery of previously unrecognized genes involved in colony / biofilm formation.We also encourage future work applying our general methodology to sets of strains in other species, as we expect our approach will generalize beyond the model organism P. aeruginosa.In summary, the level of classification accuracy achieved in this work illustrates the potential of image-based prediction tools, deep learning, transfer learning, and data augmentation in the characterization of bacterial strains.

Strain culturing
This study uses a collection of 69 clinical and environmental P. aeruginosa isolates [45][46][47], each grown and imaged in quadruplicate.We used Luria-Bertani (LB) agar (1.5%) plates supplemented with 0.08% Congo Red, a dye commonly used to characterize biofilm formation which binds to compounds in the extracellular biofilm matrix.In an effort to minimize variation in plate preparation and the age of the plates, 20ml volume plates were made 24 hours before each experiment with a large, motorized pipette and kept in a sealed plastic sleeve at 4C. Strains were inoculated into LB broth and incubated shaking overnight at 37C.The day of the experiment, each plate was then sliced into 4 sections using a sterilized razor and tweezers, which both limited colony outgrowth and prevented the diffusion of small molecules between colonies.5μl of the overnight culture was then spotted onto pre-prepared, sectioned LB agar Congo Red plates.After the spots dried, plates were parafilmed to retain moisture and placed in a 37C static incubator.All strains were incubated at 37C for 72 hours, which was determined during the pilot experiment to be enough time to reveal major morphological differences while minimizing outward expansion.

Colony imaging
After 72 hours, strains were imaged on a Nikon Eclipse Ti inverted microscope at 4X using the DS-Fi2 colored camera.This results in a pixel-to-size ratio is 2.07uM per pixel.Large, stitched images were required to scan an entire colony and were generated with Nikon's built-in software.Some replicate colonies had to be excluded from analysis due to error (strains 13, 47, 78, 86, 122, 124, 132, 210 have 3 replicates and strain 329 has 2 replicates).This resulted in a dataset of 266 P. aeruginosa colony images that ranged in size from 12,644,652 pixels (approximately 0.74cm x 0.74cm) to 183,329,280 pixels (approximately 2.8cm x 2.8cm).The downsized images used in the manuscript can be found at https://github.com/GaTechBrownLab/Rattray-2023-PLOSCompBio/.

Morphological and complexity metrics
For the morphological and complexity measurements, colony images were analyzed in Mathematica.Images were scaled down by a factor of four, to make computations timelier, and the backgrounds were cropped out by using a watershed transformation, binarization, and selection of the contiguous non-zero area surrounding the colony.The morphological metrics (Area, EquivalentDiskRadius, PerimeterLength, MeanCentroidDistance, Eccentricity, Circularity, BoundingDiskCenter, and MeanCaliperDiameter) were all calculated using the native Mathematica commands.For the complexity metrics, CompRatio was calculated by comparing the number of bytes in the uncompressed and compressed (using the native Mathematica command) colony images.A Sobel-Freidman operator was applied in both x and y dimensions and Sobel histograms were generated.MeanSobel, MedianSobel, SDSobel, RootMSSobel, SkewSobel, and KurtosisSobel were all then calculated using the native Mathematica commands.

Data pre-processing for machine learning
Our experimental pipeline is an integration of several steps: preprocessing, transfer learning, and data augmentation techniques, all aimed at maximizing the predictive performance of our deep learning model.In the preprocessing phase, all images are first downsampled to a uniform size of 224x224 pixels, consistent with the image dimensions used in the pretrained Ima-geNet database.To preserve the original aspect ratio and minimize distortions during the resizing process, we employed inter-area interpolation.
In addition to resizing, another important step in our preprocessing is image normalization.This includes two primary aspects.The first is the correction of image backgrounds: the algorithm iterates over the images and replaces any black pixels with white ones.This process aids in standardizing the image backgrounds across our dataset, ensuring that the model does not get biased by variations in the background and thus enhancing its robustness to different imaging conditions.
The second aspect of image normalization involves rescaling the pixel intensities.Raw images are composed of pixels with intensity values that usually fall within a range of 0 to 255.These large values may slow down the learning process and make the model more susceptible to suboptimal solutions.To mitigate this, we applied a transformation to rescale the pixel values to a 0-1 range.This process, often termed pixel normalization, can help improve the efficiency of the learning process by preventing large input values from causing numerical instability and by ensuring that the input features have a similar data distribution, thus assisting the optimization algorithm to converge more quickly [62].
Overall, these preprocessing steps serve as an integral part of the pipeline, shaping the data into a more suitable form for the model to learn effectively and efficiently, while ensuring robustness to potential sources of variation.

Data augmentation
In order to further optimize our model's performance and ensure its robustness, we utilized various data augmentation techniques.This approach, known to improve the generalization capability of deep learning models by artificially expanding the training dataset [63], involves making stochastic transformations to the training images, thereby mimicking variations likely to be seen in the real-world application of the model.One such transformation was brightness augmentation, where the brightness of the image is varied randomly within a range.Specifically, the brightness was altered by factors between 0.2 and 0.8, effectively simulating various lighting conditions that the images might be subjected to in real-world scenarios [64].This PLOS COMPUTATIONAL BIOLOGY rescaling process not only aids in normalizing the data but also helps in mitigating the risk of gradient explosion, a phenomenon that can hinder the learning process [65].Random rotations within a range of -20 to 20 degrees and random shifts in image width and height by up to 20% of the original size are also used.These techniques simulate changes in the object's orientation and position within the frame, thus allowing our model to learn invariance to such alterations [66].The application of shear transformations and zoom operations, both by up to 20%, mimic the effects of perspective distortion and changes in the object-camera distance respectively.These techniques enhance the model's ability to recognize the subject regardless of variations in the camera perspective or object scale [49].Finally, we also incorporated horizontal flipping of images, a transformation especially useful for our dataset since it does not possess any inherent left-right bias.This method doubles the size of the dataset and helps the model generalize better by learning symmetrical features [67].

Train/validation/test split
Images from the initial 4 replicate imaging experiment were divided (following augmentation) into a 90-10 train/validation split.The imaging protocol was repeated by a separate experimenter to generate an independent test sample.

Transfer learning models
In our study, we employed four deep learning models to classify bacterial strains, leveraging the advantages of both transfer learning and architectural diversity.Each model used was pretrained on the ImageNet dataset and further fine-tuned to our specific task.This section details the deep learning models used, the modifications we made, and our rationale for these choices.Parameters for all models are detailed in S3 Table .The code can be accessed at https://github.com/GaTechBrownLab/Rattray-2023-PLOSCompBio/.
Before the training of any models, the topmost layer of each pre-trained model was removed, and replaced by a new flatten layer.This was followed by two pairs of dense and dropout layers, and finally, a softmax dense output layer.The dense layers were designed using Rectified Linear Unit (ReLU) activations [68], while the final dense layer employs a softmax activation function.This transforms the output into a probability distribution across 69 neurons, each corresponding to a unique strain identification within our dataset.The resulting confidence vector is obtained when the output layer is activated.
The decision to downsize the new dense layers from the original size (e.g., from 2048 to 1024 in ResNet-50) was influenced by the size of our dataset in contrast to ImageNet.Dropout layers were also incorporated to prevent overfitting, following the transfer learning research conducted by [69].

ResNet-50
ResNet-50 [70] is known for its residual blocks, which mitigate the vanishing and exploding gradient issues common in deep learning models.In our application, the weights of the preexisting layers were frozen, and only the newly added dense layers were trained-a method known as 'feature extraction.'This method exploits the pre-trained model's ability to extract generalized features, while the added layers learn to make predictions based on these features.
The use of ResNet-50 in bacterial strain identification tasks is well-documented.A study [71] employed a ResNet-based approach to classify three species of gram-positive bacteria, achieving an accuracy of 81%.Another study [72] compared the performance of various deep learning models, including ResNet, for bacilli detection, demonstrating the efficacy of these PLOS COMPUTATIONAL BIOLOGY architectures in bacterial identification tasks.These previous successful uses of ResNet-50 provide a robust motivation for its selection in our study.

MobileNetV2
We employed the MobileNetV2 model for its compactness and efficiency, making it suitable for mobile and embedded vision applications [73].Similar to ResNet-50, we used the feature extraction approach with this model.

VGG-19
The VGG-19 model [53] was also utilized in our study due to its proven efficacy in various classification tasks.We implemented the feature extraction strategy, leveraging the learned features from the model and fine-tuning our custom layers for our specific task.

Xception
The Xception model was incorporated due to its unique architecture, designed to recognize nuanced patterns [74].For Xception, we adopted a 'fine-tuning' strategy, training all layers of the network with a smaller learning rate.This allowed the model to adjust the learned weights more accurately to our problem.

Benchmark models
For comparison, two 'shallow learning' methods were used as benchmarks.First, a Radial Basis Function (RBF) Support Vector Machine (SVM) was employed for strain classification using colony metric data (Fig 3 ), following the successful application demonstrated by Chen et al. [75] in bacterial classification.
Additionally, a hybrid approach was adopted to exploit the strengths of both deep and shallow learning.Here, we extracted feature vectors from our top-performing deep learning model, ResNet-50, which were then used as input for a RBF SVM.This process of feature extraction involves using ResNet-50's pre-trained weights to generate a representative vector of the input image.These vectors essentially encapsulate the critical visual patterns recognized by the model, thus serving as robust predictors in the SVM.This hybrid approach involves loading the pre-trained ResNet-50 model with weights trained on the ImageNet dataset and setting the 'include_top' parameter to False.This configuration returns a model that does not include the final dense layer, making it suitable for feature extraction.An image is loaded and preprocessed to match the input requirements of ResNet-50, and the model's 'predict' function is then used to generate the feature vector.This feature vector is then used as input to the SVM for classification.This transfer learning strategy has been successfully applied in numerous studies, including Rahmayuna et al. [76], who achieved a remarkable classification accuracy of 90.33%.  A. Summary statistics.Across replicates we report average (+/-standard deviation) accuracy (the number of correct predictions divided by the total number of predictions x100) and loss (a summation of the errors made for each sample).Table B

Fig 1 .
Fig 1. Bacterial colony morphology varies across and within species.A) Morphological identification of bacterial species from a mixed culture plated on Chocolate Agar: Rothia mucilaginosa (smallest white circles), Neisseria subflava (larger tan round circles), and Streptococcus mitis (large bullseye circles).B) A common example of 2D colony morphology features include the appearance of the colony edges.C) Morphological identification of bacterial strains, in this case called "smooth", "wrinkly spreader", and "fuzzy spreader"[4], from a culture containing only Pseudomonas fluorescens plated on King's Medium B Agar.

Fig 3
summarizes variation in our metrics across both strains and replicates.InFig 4,  we assess the extent of variation using coefficients of variation (CV = mean / standard deviation), across replicates (black bars) and across strains (grey bars).With the exception of the

Fig 5
illustrates that model performance varies significantly between transfer learning models (e.g. for test accuracy: ANOVA, F = 124.2,df = 3, p < 0.05.See Table B in S1 Table ResNet and Xception do not show significantly different performance in posthoc tests (Tables C-F in S1 Table).In terms of best individual model run performance, both the top accuracy and top loss was achieved by the ResNet model (Fig 5), so we use this as our baseline in our subsequent analyses.

Fig 3 .
Fig 3. Diversity of P. aeruginosa strains in both classic morphological descriptive variables and derived complexity descriptive variables across 69 strains and 266 colonies.Histograms are built from all replicates of all strains.A) classic metrics used to describe colony appearance.B) Derived metrics from image processing and computer vision to describe colony complexity including compression ratio (relative reduction in size of data) and 6 descriptive statistics derived from the Sobel-Feldman operator.https://doi.org/10.1371/journal.pcbi.1011699.g003

Fig 4 .
Fig 4. Morphological metrics are generally stable across replicates.Coefficient of variation (mean / standard deviation) across replicates (black) and across strains (grey).With the exception of the eccentricity and circularity metrics, coefficients of variation across replicates are low (standard deviation << mean), and less than the coefficient of variation across strains.https://doi.org/10.1371/journal.pcbi.1011699.g004

Fig 5 .
Fig 5. Performance comparison of transfer learning methods.The performance of four trained transfer learning models (ResNet-50, VGG-19, MobileNetV2 and Xception, see methods) were evaluated on both validation (blue circles) and test (orange triangles) datasets.Each computational experiment was replicated five-fold (five separate training runs for each model), allowing statistical comparison of approaches.(A) Accuracy scores (the number of correct predictions divided by the total number of predictions x100).(B) Loss scores (a summation of the errors made for each sample).Statistical comparisons are summarized in S1 Table. https://doi.org/10.1371/journal.pcbi.1011699.g005

Fig 6 .
Fig 6.Reduced confusion matrix, aggregating all test data errors across five iterations of the ResNet-50 model.Of the 69 strains in this study, only 5 strains were not classified with 100% precision (the 5 rows).For brevity, this matrix includes only the strains that were misclassified by our model instead of the whole 69 x 69 confusion matrix.https://doi.org/10.1371/journal.pcbi.1011699.g006

S1Fig.
Correlation matrix for all morphological and complexity metrics.Morphological (green) and complexity (orange) metrics.(DOCX) S1 Table.Statistical comparisons of transfer learning methods (Fig 5).The performance of four trained transfer learning models (ResNet-50, VGG-19, MobileNetV2 and Xception, see methods) were evaluated on both validation and test datasets.Each computational experiment was replicated five-fold (five separate training runs for each model), allowing statistical PLOS COMPUTATIONAL BIOLOGY comparison of approaches.Table

Table ) ,
while VGG-19 performs significantly worse than ResNet and Xception on test loss (Table D in S1 Table).

Table 2 . Performance comparison with shallow learning (SVM) models.
We contrast validation and test accuracy for our trained ResNet model (see also Fig 5 and S1 Table) with accuracy metrics for radial basis function (RBF) SVMs trained on colony metric data (Figs 3 and 4) and on features extracted from the trained ResNet model.See methods for details of SVM models and feature extraction.Statistical comparisons are summarized in S4 Table. https://doi.org/10.1371/journal.pcbi.1011699.t002

S2 Table. Performance contribution of data pre-processing, augmentation and training.
. ANOVA table.Tables C-F.Post-hoc pairwise tests (Tukey HSD with alpha = 0.05).(DOCX) To assess contributions we took the trained ResNet-50 model (Fig 5 and Table A in S1 Table) as a baseline method, and assessed the impact of removing components of our methods pipeline.Five-fold replicated results are summarized in Table 1.Table A. ANOVA table of data in

S4 Table. Performance comparison with shallow learning (SVM) models.
We contrast validation and test accuracy for our trained ResNet model (see also Table 1) with accuracy metrics for radial basis function (RBF) SVMs trained on colony metric data (Fig 3 and 4) and on features extracted from the trained ResNet model.See methods for details of SVM models and feature extraction.Table A. ANOVA of Table 2. Tables B-C.Post-hoc pairwise tests (Tukey HSD with alpha = 0.05).(DOCX)