Fig 1.
Sample processing, image acquisition and dataset information.
A: Diagram showing urine sample processing using a capillary and image acquisition with SchistoScope (Ai). Diagram showing capillary dimensions and egg trapping (Aii). Partially created with BioRender.com B: Example images in BF and DF of two fields of view of a capillary containing S. haematobium eggs and other debris. C: Examples of S. haematobium eggs and distractor objects trapped in capillaries and imaged with the SchistoScope. D: Information about Datasets 1, 2, and 3. Dataset 1 was collected in March 2020 [23] and Dataset 2 was collected in November 2021 [24], in different field sites in Côte d’Ivoire. Dataset 3 is a combination of Datasets 1 and 2 and was randomly split into train and test sets—the table shows information for the test set of Dataset 3. In the field studies where they were collected, the percentage of urine specimens examined that were found to contain S. haematobium eggs using conventional light microscopy was 20.6% for Dataset 1 and 13.4% for Dataset 2.
Fig 2.
Dataset preparation, ML model training and evaluation of Dataset 1 k-fold splits.
A: Diagram of the ML model training pipeline. First, S. haematobium eggs are annotated in dataset images. For Dataset 1, the patients are then split into 5 folds containing different subsets of training and test data. Transfer learning is done by fine-tuning the ML models (YOLOv8 pre-trained on the COCO 2017 dataset) using the training set for each split. B: Diagram of the model evaluation pipeline. After training, the test images are run through the trained model, generating bounding boxes surrounding detections with a confidence score assigned by the model. The number of detections above a certain confidence score threshold are counted for each patient. Each patient is represented by the first two images of a capillary. Subsequently, patients are classified as positive or negative depending on the presence or absence of detections with a confidence score above a given threshold. Sensitivity and specificity metrics are calculated on a patient population level. C: Full and zoomed-in receiver operator characteristic (ROC) curves for the first split of the data of Dataset 1, showing results for the BF and DF ML models and the area under each curve. The partial ROC curve is displayed as an inset of the full curve, it shows specificity values from 95% to 100%. The vertical lines indicate the targeted specificity for the transmission interruption and surveillance (TI&S) and monitoring and evaluation (M&E) TPP use cases (99.5% and 96.5%, respectively). D: Violin plots showing the patient-level sensitivity values for the 5 splits of Dataset 1 for the TI&S (Di) and M&E (Dii) use cases. The mean sensitivity is displayed above each violin and the targeted sensitivity for each use case is shown as a vertical line. Di shows the sensitivity at a threshold that resulted in 99.5% specificity. Dii shows the sensitivity at a threshold that resulted in 96.5% specificity. BF is brightfield and DF is darkfield.
Table 1.
Diagnostic Target Product Profile (TPP) requirements.
Fig 3.
Contrast combination rubrics and patient-level sensitivity on 5-fold splits, Dataset 1.
A: Combination pipelines. Ai: Diagram of the patient-level combination pipeline. Aii: Diagram of the object-level combination pipeline. B: Truth table for patient-level combinations, showing the four possible combinations of patient classifications based on BF and DF models individually, followed by the result after patient-level combinations. Positive patients are shown in magenta and negative patients are shown in green. C: Examples of object-level combinations on three objects in the images, showing original confidence scores assigned by BF and DF models, followed by resulting confidence scores after each combination. Green boxes represent true positive detections and magenta boxes represent false positive detections. D: Violin plots showing the sensitivity values after applying patient-level combinations to the 5 splits of Dataset 1 for the TI&S (Di) and M&E (Dii) use cases. The mean sensitivity is displayed above each violin and the targeted sensitivity for each use case is shown as a vertical line. Di shows the sensitivity at a threshold that resulted in 99.5% specificity. Dii shows the sensitivity at a threshold that resulted in 96.5% specificity. ‘BF’ is brightfield, ‘DF’ is darkfield, ‘PL AND’ is patient-level AND, ‘PL OR’ is patient-level OR, ‘OL AND’ is object-level AND, ‘OL OR’ is object-level OR. E: Violin plots showing the sensitivity values after applying object-level combinations to the 5 splits of Dataset 1 for the TI&S (Ei) and M&E (Eii) use cases. The mean sensitivity is displayed above each violin and the targeted sensitivity for each use case is shown as a vertical line. Ei: sensitivity at a threshold that resulted in 99.5% specificity. Eii: sensitivity at a threshold that resulted in 96.5% specificity.
Fig 4.
Patient-level results on Dataset 2 as a holdout.
A: Diagram of data used for training and testing. B: Results for brightfield (BF) and darkfield (DF) models trained on Dataset 1 and tested on Dataset 2. Bi: zoomed-in ROC curve showing specificity values from 95% to 100%, with TPP specificity requirements shown as vertical lines. Bii and Biii: patient-level sensitivity for BF and DF models for the TI&S (Bii) and M&E (Biii) use cases, with sensitivity values for each model displayed above each bar and target sensitivity displayed as a horizontal line. C: Results for model combinations on Dataset 2. Ci: zoomed-in ROC curve showing specificity values from 95% to 100%, with TPP specificity requirements shown as vertical lines. Cii and Ciii: sensitivity results for BF and DF models and combinations for the TI&S (Cii) and M&E (Ciii) use cases. PL AND is patient-level AND, PL OR is patient-level OR, OL AND is object-level AND, OL OR is object-level OR. D: Bootstrapping results on the holdout set for TI&S and M&E TPP use cases. The violin plots show the distribution of patient-level sensitivity values at thresholds resulting in the targeted TPP specificity. Bootstrapping was performed for 100 iterations, with sample size = 40% of the patient population. The dashed lines inside violins show the median of the distribution, dotted lines show the quartiles. The median of each distribution is displayed above each violin. A Kruskal-Wallis test with Dunn’s correction for multiple comparisons was used to compare the BF model with the DF model and the combination models. We report multiplicity-adjusted p-values, “ns” is p > 0.05, is p ≤ 0.05, *** is p ≤ 0.001, **** is p ≤ 0.0001.
Fig 5.
Patient-level results on Dataset 3.
A: Patient-level results on the test set of Dataset 3. Ai: zoomed-in ROC curve with specificity values ranging from 95% to 100%, with TPP specificity requirements shown as vertical lines. Aii and Aiii: patient-level sensitivity results for BF and DF models and combinations for the TI&S and M&E use cases (at thresholds resulting in TPP specificity). The target sensitivity for each use case is shown as a horizontal line. BF is brightfield, DF is darkfield, PL AND is patient-level AND, PL OR is patient-level OR, OL AND is object-level AND, OL OR is object-level OR. B: Bootstrapping results on the test set of Dataset 3 for TI&S and M&E TPP use cases. The violin plots show the distribution of patient-level sensitivity values at thresholds resulting in the targeted TPP specificity. Bootstrapping was performed for 100 iterations, with sample size = 40% of the patient population. The dashed lines inside violins show the median of the distribution, dotted lines show the quartiles. The median of each distribution is displayed above each violin. A Kruskal-Wallis test with Dunn’s correction for multiple comparisons was used to compare the BF model with the DF model and the combination models. We report multiplicity-adjusted p-values, “ns” is p > 0.05, is p ≤ 0.05, *** is p ≤ 0.001, **** is p ≤ 0.0001.