Fig 1.
A: Schematic of bacteria reproducing in a mother machine, the arrow on the right shows the growth direction. B: Phase-contrast time-lapse images of bacteria growing and reproducing in a mother machine. C: The fluorescence images captured after the final phase-contrast frame reveal the species in each trap (used as ground truth labels), with E. coli in green, K. pneumoniae in cyan, E. faecalis in magenta, P. aeruginosa in orange, A. baumannii in yellow, P. mirabilis in red and S. aureus in blue. Examples of cropping targets (traps containing a single species) for each of the seven species are outlined with dashed lines and pointed to by arrows in corresponding colors. D: Input to the neural network is phase-contrast only, a single frame or time-lapse video. Borders are color-coded according to the respective species.
Fig 2.
One trap phase-contrast time-lapse and final fluorescence images.
Image of a cropped-out trap in the mother machine growing for 70 minutes and imaged every two minutes, resulting in 36 phase-contrast frames in the time-lapse (numbered 0-35), and the final fluorescence images for each channel are shown side-by-side. This sample is A. baumannii from Experiment 14 of the training set (see Table 1 in Materials and Methods).
Fig 3.
Comparing model and Video ResNet R(2+1)D species classification performance over time, gradually using more frames in the time-lapse.
Fig 4.
Video classification performance.
Classification performance of Video ResNet R(2+1)D utilizing the full crop size 52x114 pixels, using all 35 frames (full time-lapse). Left: Confusion matrix for the classifier with the median AUC across the five retrainings. Right: Receiver operating characteristic curves for the same model. Note that the plot has been zoomed in for visibility. The legend displays the mean and standard deviation of the AUC across the five retrainings.
Fig 5.
Model comparison on downsampled images.
Retraining the networks with the image input spatially downsampled during test and training, testing on the whole time-lapse. Left: Model comparison. Using spatiotemporal features (video classifier) is especially important at very low resolution. Right: Species comparison using Video ResNet R(2+1)D at 5x11 pixels. Four species attain classification performance of 0.99 AUC even at very low resolution.
Fig 6.
Video classification performance at 5x11 pixels.
Classification performance of Video ResNet R(2+1)D utilizing spatially downsampled images 5x11 pixels, using all 35 frames (full time-lapse). Left: Confusion matrix for the classifier with the median AUC across the five retrainings. Right: Receiver operating characteristic curves for the same model. Note that the plot has been zoomed in for visibility. The legend displays the mean and standard deviation of the AUC across the five retrainings.
Fig 7.
Classification performance at various downsampling steps.
Left: Degradation of performance during time evaluations using Video ResNet R(2+1)D trained with spatially downsampled data. Distinct performance jumps occur at specific time points for the highly downsampled data, highlighting the significance of spatiotemporal features at low resolutions. Right: Time evaluation of Video ResNet R(2+1)D operating at spatially downsampled crop size of 5x11, visualizing species-specific AUC. Note that the y-axis of this plot starts at 0.8AUC.
Fig 8.
Video classification performance at various downsampling steps.
Model comparison between video and image classification at gradually lower resolutions. Video classification performs better, especially at lower resolutions.
Fig 9.
The performance was tested by removing one augmentation and retraining the Video ResNet R(2+1)D. The baseline with all augmentation present is shown using a dashed red line. The plots are split up for readability due to the removal of RandomBrightnessContrast having a larger impact. Also, using label smoothing was beneficial, preventing the networks from overfitting. The removal of all augmentations experiment used randomly initialized weights and no label smoothing.
Fig 10.
Assessing the significance of texture, morphology, cell length, and growth and division speed by comparing static images classified using ResNet 18 with video clips classified using Video ResNet R(2+1)D.
Fig 11.
Results using data quality scoring.
The ResNet 18 and Video ResNet R(2+1)D networks were retrained 30 times using three different approaches: (1) using all data in the training set but utilizing a random weighted sampler, (2) limiting the number of traps per species to 2600 using random selection, (2) limiting the number of traps per species to 2600 but using data quality score to select the traps.
Fig 12.
Performance when including more traps per species in the dataset. The lines are added moving averages with a rolling window size of 5 to track the general trend. Augmentation was essential across all dataset sizes for the model’s ability to generalize.
Table 1.
Data summary.
Fig 13.
Ten concatenated sample images (52x512 pixels) of each bacterial species taken from randomly selected traps. Visually distinguishing between the rod-shaped bacteria in this dataset is challenging, even for trained humans. Specifically, differentiating between E. coli, P. mirabilis, and K. pneumoniae. P. aeruginosa often has a characteristic thin-curved shape, while A. baumannii is comparatively shorter and thicker, however, these are not always reliable features. S. aureus had a larger diameter than E. faecalis for these isolates.
Fig 14.
Four modifications visualized for a part of a single trap (52x256 pixels, 15 frames). In the row-wise mean images, the cell height and division speed is visible, but not morphology and texture information. In the mask, no texture information is visible, and for the segment, all background is blocked out, aiming to reduce the risk of models overfitting to experimental settings.