Table 1.
Counts are given for the total number of images and patients, and for each of the train, validation and test sets. In addition, counts for the number of female and male images/patients in each of the datasets, and in aggregate, are included.
Table 2.
Single-domain results for fine-tuned ResNet-152 models.
B = 1000, α = 0.05. Unless explicitly stated, pempir ≤ 10−3 for all significance tests prior to correction for multiple comparisons, and padj ≤ 1.1 ⋅ 10−3 for all significance tests after adjustment for multiple comparisons using BH. Significant results are marked by an asterisk. Trend for significance (0.05 < p < 0.1) is marked by a number sign.
Table 3.
Ensembling experiment results: Scores for E1–10 developed using the DOVS-ii database, and the ensemble classifier E*.
The unadjusted p-value obtained from bootstrap replicates satisfies pempir ≤ 10−3 for each line in the table. The adjusted p-value padj satisfies padj ≤ 1.1 × 10−3 (B = 1000, α = 0.05, ntest = 376). The row for run E* displays statistics for the (10, 20)-ensemble classifier created from the component models E1–10. Significant results are marked by an asterisk.
Fig 1.
Training-time metrics for runs D1 through D5.
(a) accuracy score (proportion correct); (b) binary cross-entropy loss; (c) AUC score. Vertical dashed lines correspond with the best epoch as selected by early stopping.
Fig 2.
Training-time metrics for runs N1 through N6.
(a) accuracy score (proportion correct); (b) binary cross-entropy loss; (c) AUC score. Vertical dashed lines correspond with the best epoch as selected by early stopping.
Fig 3.
Training-time metrics for runs C1 through C6.
(a) accuracy score (proportion correct); (b) binary cross-entropy loss; (c) AUC score. Vertical dashed lines correspond with the best epoch as selected by early stopping.
Table 4.
B = 1000, α = 0.05. Unless explicitly stated, the unadjusted p-values satisfy pempir ≤ 10−3 for all significance tests prior to correction for multiple comparisons, and padj ≤ 1.1 ⋅ 10−3 for all significance tests after adjustment for multiple comparisons using BH. Significant results are marked by an asterisk.
Fig 4.
A graphical representation of all models developed in this work.
On the x-axis is a model’s AUC score on the validation partition of the relevant database; on the y-axis, its AUC score on the test partition. Points labelled D1–5 correspond with models trained and evaluated on DOVS-i; E1–10, DOVS-ii; N1–6, ODIR-N; C1–6, ODIR-C. The ensemble model is denoted E*. Marker size corresponds with the number of images in the database used for model development (cf. Table 1).
Fig 5.
A graphical representation of the domain adaptation results for the models developed in this work.
On the x-axis is a model’s AUC score on the test partition from the database on which it was trained; on the y-axis, its domain-adapted AUC score. Points labelled D1–5 correspond with models trained and evaluated on DOVS-i; E1–10, DOVS-ii; N1–6, ODIR-N; C1–6, ODIR-C. The ensemble model is denoted E*. The marker radius of the plotted points corresponds with the number of images in the database used for model development (cf. Table 1). Each point is coloured according to the database on which the associated model was trained, and the shape of each point corresponds with the database on which that model was evaluated for its domain-adapted AUC score.
Fig 6.
Mean Guided Grad-CAM activations for Female (top row) and Male (bottom row) fundus images.
(a) F227_R, (b) F22_L, (c) M218_L, (d) M273_R.