Fig 1.
Example thyroid gland tumor images from the Tharun and Thompson dataset and the Nikiforov dataset.
The images are cropped to 512 × 512 px. The upper row shows non-PTC-like samples and the lower row shows PTC-like samples. A high-resolution version is available in S1 Fig.
Fig 2.
The feature-based classification consisting of segmentation, feature extraction and classification.
Fig 3.
Exemplary thyroid images from the Tharun and Thompson dataset and the corresponding instance segmentations.
The images are 200 × 200 px crops. The left images show a non-PTC-like sample and the right images a PTC-like sample.
Fig 4.
The deep learning-based classification performs the classification directly on the thyroid images.
Fig 5.
Confusion matrices for the Tharun and Thompson dataset.
Fig 6.
The receiver operating characteristic curves for the feature-based classification, where a true positive sample is a PTC-like sample classified as PTC-like.
All cross-validation splits and the resulting area under curve (AUC) together with the resulting mean for the dataset are shown. The standard deviation around the mean is annotated in dark gray. The red dotted line corresponds to the classification by chance.
Table 1.
Number of correct predictions relative to the amount of samples split into the diagnostic groups of the initial dataset.
Results are shown for the feature-based classification (FBC) and deep learning-based classification (DLC).
Fig 7.
Confusion matrices for the Nikiforov dataset.
Fig 8.
The receiver operating characteristic curves for the feature-based classification, where a true positive sample is a PTC-like sample classified as PTC-like.
All cross-validation splits and the resulting area under curve (AUC) together with the resulting mean for the dataset are shown. The standard deviation around the mean is annotated in dark gray. The red dotted line corresponds to the classification by chance.
Fig 9.
The evaluation on samples with a pathologist agreement level greater than a threshold is shown in (a). The feature-based classification (FBC accuracy, blue dots) outperforms the mean expert pathologist rating (green squares) and the accuracy achievable by knowing the data split (red crosses). The performance of the deep learning-based classification (DLC accuracy, orange triangles) varies between the FBC accuracy and the mean expert pathologist rating. The number of samples evaluated for different minimal pathologist agreement levels is shown in (b). The PTC-like samples are shown in blue dots and the non-PTC-like samples are shown in orange squares.