Learning from small data: Classifying sex from retinal images via deep learning

doi:10.1371/journal.pone.0289211

Table 1.

Dataset statistics.

Counts are given for the total number of images and patients, and for each of the train, validation and test sets. In addition, counts for the number of female and male images/patients in each of the datasets, and in aggregate, are included.

More »

Expand

Table 2.

Single-domain results for fine-tuned ResNet-152 models.

B = 1000, α = 0.05. Unless explicitly stated, p_empir ≤ 10⁻³ for all significance tests prior to correction for multiple comparisons, and p_adj ≤ 1.1 ⋅ 10⁻³ for all significance tests after adjustment for multiple comparisons using BH. Significant results are marked by an asterisk. Trend for significance (0.05 < p < 0.1) is marked by a number sign.

More »

Expand

Table 3.

Ensembling experiment results: Scores for E1–10 developed using the DOVS-ii database, and the ensemble classifier E*.

The unadjusted p-value obtained from bootstrap replicates satisfies p_empir ≤ 10⁻³ for each line in the table. The adjusted p-value p_adj satisfies p_adj ≤ 1.1 × 10⁻³ (B = 1000, α = 0.05, n_test = 376). The row for run E* displays statistics for the (10, 20)-ensemble classifier created from the component models E1–10. Significant results are marked by an asterisk.

More »

Expand

Fig 1.

Training-time metrics for runs D1 through D5.

(a) accuracy score (proportion correct); (b) binary cross-entropy loss; (c) AUC score. Vertical dashed lines correspond with the best epoch as selected by early stopping.

More »

Expand

Fig 2.

Training-time metrics for runs N1 through N6.

(a) accuracy score (proportion correct); (b) binary cross-entropy loss; (c) AUC score. Vertical dashed lines correspond with the best epoch as selected by early stopping.

More »

Expand

Fig 3.

Training-time metrics for runs C1 through C6.

(a) accuracy score (proportion correct); (b) binary cross-entropy loss; (c) AUC score. Vertical dashed lines correspond with the best epoch as selected by early stopping.

More »

Expand

Table 4.

Domain adaptation results.

B = 1000, α = 0.05. Unless explicitly stated, the unadjusted p-values satisfy p_empir ≤ 10⁻³ for all significance tests prior to correction for multiple comparisons, and p_adj ≤ 1.1 ⋅ 10⁻³ for all significance tests after adjustment for multiple comparisons using BH. Significant results are marked by an asterisk.

More »

Expand

Fig 4.

A graphical representation of all models developed in this work.

On the x-axis is a model’s AUC score on the validation partition of the relevant database; on the y-axis, its AUC score on the test partition. Points labelled D1–5 correspond with models trained and evaluated on DOVS-i; E1–10, DOVS-ii; N1–6, ODIR-N; C1–6, ODIR-C. The ensemble model is denoted E*. Marker size corresponds with the number of images in the database used for model development (cf. Table 1).

More »

Expand

Fig 5.

A graphical representation of the domain adaptation results for the models developed in this work.

On the x-axis is a model’s AUC score on the test partition from the database on which it was trained; on the y-axis, its domain-adapted AUC score. Points labelled D1–5 correspond with models trained and evaluated on DOVS-i; E1–10, DOVS-ii; N1–6, ODIR-N; C1–6, ODIR-C. The ensemble model is denoted E*. The marker radius of the plotted points corresponds with the number of images in the database used for model development (cf. Table 1). Each point is coloured according to the database on which the associated model was trained, and the shape of each point corresponds with the database on which that model was evaluated for its domain-adapted AUC score.

More »

Expand

Fig 6.

Mean Guided Grad-CAM activations for Female (top row) and Male (bottom row) fundus images.

(a) F227_R, (b) F22_L, (c) M218_L, (d) M273_R.

More »

Expand