Fig 1.
Workflow of the data collection and training process.
(a) Culturing of E. coli as described in the text. (b) ‘Cartoon’ schematic of the holographic microscope. (c) An example of a raw frame of data, showing scattering from dust on optical elements, diffraction etc. The raw video frames are normalized using a median image (see main text) to remove the static background. (d) A normalized image in which the cells can be seen as concentric sets of diffraction rings. (e) DHM-generated Training labels, overlaid on the background-corrected image. (f) Frames and labels are saved in the format required for training a YOLOv5 network. (g) The training was performed on a GPU cluster for 100 epochs (see main text for details). (h) and (j) The trained network. (i) Normalized image of a test video frame. (k) Predicted bounding boxes overlaid on the normalized test image. (l) One of the tracks obtained from the predicted coordinates of the model, compared with the ground truth counterpart.
Fig 2.
Data frequency and training fit line.
(a) Establishment of an initial heuristic measure to train a CNN. The apparent extent of the outermost ring of the scattering pattern was measured by hand, and is plotted here against the axial position of the cells. The dashed line is the linear heuristic relationship between the axial position and the bounding box side length used to create the training data for the CNN and is a vertical offset of the solid line (see text). (b) Axial distribution of cells in our sample volume, in both training and test data sets. The data are biased towards smaller axial distances owing to the effects of gravitational sedimentation and the ‘wall accumulation’ effect in swimming bacteria, observed in previous studies (see text).
Fig 3.
Box and object losses for training and validation sets, precision, recall, and mean average precision (mAP) of the network over the 100 training epochs. Box loss is a measure of error between the predicted and ground truth bounding boxes, object loss is a measure of how good the model is at predicting whether an object (a scattering pattern) is present in a bounding box. Precision is the proportion of detected objects that were correct, recall is the ability to detect all objects in an image. A precision recall curve is a useful measure of the trade off between the metrics for a model, and the area under the curve is called the average precision (AP). Intersection over union (IoU) is a measure of the overlap of predicted and ground truth bounding boxes. Mean average precision (mAP) is the mean of the average precision (AP), usually calculated for thresholds of IoU, hence mAP(0.5) is the mAP for objects detected with bounding box IoU of at least 0.5 and mAP(0.5:0.95) is the average of mAP calculated at a range of thresholds from 0.5 to 0.95.
Fig 4.
Cell position and bounding box predictions on unseen video frames.
(a) Dashed cyan lines show the predicted bounding boxes for objects in a frame from the dilute test videos. (b) Crosses show predicted objects in a frame from a more concentrated sample; smaller yellow crosses show small scattering patterns (objects close to the focal plane) and larger orange crosses larger ones (more distant objects).
Table 1.
Metrics for different data sets when run through the trained model in inference.
Descriptions of metrics are found in the figure caption for Fig 3.
Fig 5.
Results of inference on the validation and test videos.
Data from test videos are represented by orange triangles, and data from validation videos are plotted with blue circles. (a) Raw axial coordinates obtained by the trained YOLOv5 as a function of ‘true’ axial depth established by DHM. The red dot-dashed line is a fourth-order fit to the validation data. (b) Test video axial coordinates, as obtained in panel (a), after removal of fourth-order correction factor (see text). (c) Selected tracks obtained by three-dimensional, AI-enabled tracking after the second round of correction (blue solid points), plotted with the original tracks obtained by DHM (red empty points).
Table 2.
Inference time per frame using different processors.