Machine learning methods for automated classification of tumors with papillary thyroid carcinoma-like nuclei: A quantitative analysis

When approaching thyroid gland tumor classification, the differentiation between samples with and without “papillary thyroid carcinoma-like” nuclei is a daunting task with high inter-observer variability among pathologists. Thus, there is increasing interest in the use of machine learning approaches to provide pathologists real-time decision support. In this paper, we optimize and quantitatively compare two automated machine learning methods for thyroid gland tumor classification on two datasets to assist pathologists in decision-making regarding these methods and their parameters. The first method is a feature-based classification originating from common image processing and consists of cell nucleus segmentation, feature extraction, and subsequent thyroid gland tumor classification utilizing different classifiers. The second method is a deep learning-based classification which directly classifies the input images with a convolutional neural network without the need for cell nucleus segmentation. On the Tharun and Thompson dataset, the feature-based classification achieves an accuracy of 89.7% (Cohen’s Kappa 0.79), compared to the deep learning-based classification of 89.1% (Cohen’s Kappa 0.78). On the Nikiforov dataset, the feature-based classification achieves an accuracy of 83.5% (Cohen’s Kappa 0.46) compared to the deep learning-based classification 77.4% (Cohen’s Kappa 0.35). Thus, both automated thyroid tumor classification methods can reach the classification level of an expert pathologist. To our knowledge, this is the first study comparing feature-based and deep learning-based classification regarding their ability to classify samples with and without papillary thyroid carcinoma-like nuclei on two large-scale datasets.

. Medical features and the corresponding color features created together with a short feature description. The ratio, GLCM , and shannon features are calculated solely on the grayscale version of the image. medical features color features feature description irregular contours, grooves, folds, intranuclear cytoplasmic inclusions, chromatin clearing, margination to the nuclear membranes, glassy nuclei r mean mean of the pixel value inside the nucleus for the red (r), green (g), blue (b) color channel, and the grayscale image (gray) g mean b mean gray mean r std standard deviation of the pixel value inside the nucleus for the red (r), green (g), blue (b) color channel, and the grayscale image (gray) g std b std gray std intranuclear cytoplasmic inclusions, chromatin clearing, margination to the nuclear membranes  Color features For each color channel (r,g,b) the mean and standard deviation (std) for all pixels inside each individual segmented cell nucleus are calculated (r mean , r std , g mean , g std , b mean , b std ). In addition, the mean and standard deviation of each grayscale cell nucleus are calculated (gray mean , gray std ).
The intranuclear cytoplasmic inclusion, chromatin clearing and margination to the nuclear membranes are detected by calculating relations between the local coloring of the nucleus. To do so, the RGB image of the nucleus is converted to a grayscale image. As seen in Figure S1 the nucleus shown in (a) has a darker border compared to the inside of the nucleus. To quantify this feature, the ratio between the border and the center or middle areas of the nucleus is calculated. The areas shown in (c) are calculated from the binary segmentation (b). This is done by calculating the Euclidean distance map of the segmented nucleus. Afterwards the pixel values of the distance map are grouped into three zones (border (b), center (c) and middle (m)) by using the 33% and 66% percentiles (excluding the value 0 outside the cell). Each area, depicted in (c) as white, light gray and dark gray has the same amount of pixels inside. The pixels of the grayscale nucleus image (a) inside each area (c) are reduced by calculating the mean and standard deviation. This results in the auxiliary features z mean b , z std b , ..., z std m not used directly as classification features. The auxiliary features are used to calculate the relations between the local coloring separately for the mean and standard deviation auxiliary features with: and ratio M C = z m /(z m + z c ). The combinations ratio CB , ratio M B and ratio M C are redundant to the existing features (e.g. ratio BC = 1 − ratio CB ) and therefore not included in the feature set. Using mean and std for each region and the introduced ratios results in six local coloring relation features.
The features incorporating texture patterns are the gray level co-occurrence matrix features (GLCM ). The GLCMs are calculated from pixels inside a rectangular bounding box containing the grayscale image of a nucleus. The GLCMs are calculated for the angles 0 • , 45 • , 90 • , 135 • . Since cell nuclei appear in all orientations and the cell orientation is no meaningful feature, the GLCMs for all angles are added together for each cell before calculating the texture properties. This can be interpreted as increasing cell sample size by rotating the cell nucleus. The pixel distances 1 and 2 are used. From the GLCMs the properties dissimilarity, correlation, energy and homogeneity are calculated [1]. By combining the texture properties with the pixel distances, eight GLCM features are created. Furthermore, the Shannon entropy for the grayscale image of the pixels inside a rectangular bounding box containing the cell nucleus is calculated [2]. In total, 23 color features are generated.

Shape features
The shape features are extracted solely from the instance segmentation. The area feature refers to the amount of pixels inside the instance segmentation for each nucleus. Additionally, the perimeter can also be calculated directly from the instance segmentation. For the eccentricity, the ratio of focal distance over the major axis length of an ellipse with the same second-moments as the nucleus is calculated. Therefore, eccentricity is in the interval [0, 1) where 0 represents a circle. The last shape feature is the solidity. Solidity is calculated by the ratio of pixels inside the nucleus to the pixels inside the convex hull of the nucleus. In total, four shape features are calculated.
Spatial features For the spatial features, an example image is shown in Figure S2, where the centroid for each nucleus is marked by a red dot. Exemplary, the three radii x * r mean for x ∈ {3, 5, 7} are shown by red circles. Crowding of nuclei will result in a larger number of nuclei in the inner circles.

Deep learning-based classification
The deep learning-based classification is implemented in PyTorch Lightning utilizing the classification models provided with torchvision. To improve performance and reduce training time, the on ImageNet pretrained versions of the models are used. The last layer is replaced by a fully connected linear layer with two output features corresponding to the two classes PTC-like and non-PTC-like. The Adam optimizer is used in combination with cross-entropy loss. The initial learning rate is set by the PyTorch Lightning learning rate finder using the findings of. The learning rate is halved, if the loss on the validation data did not improved over the last 25 epochs. If the loss on the validation data did not improve for 50 epochs, training is stopped. The weights of the epoch with the best performance regarding the loss on validation data is chosen as the final model. To compare the results to the feature-based classification, the outer splits of the nested cross-validation from the feature-based classification are used as cross-validation splits for the deep learning-based classification. The first inner split of the nested cross-validation is used for the training and validation split of the deep learning-based classification. Therefore, the test data is exactly the same for each split in both methods. Inside each split, five models are trained and the best performing one on the validation data is chosen as a final model for the split. The final validation and test accuracy is the mean accuracy of the chosen models. The models are trained on random crops 512x512 px and the batch-size for ResNet18 and ResNet50 is set to 32. The batch-size for ResNet101 is set to 20. This enables training on 16GB VRAM.

Feature-based classification
Each classifier used in the nested cross-validation has it's own set of classifier hyperparameters. For SVC, the kernel (linear, polynomial of degree three) and the regularization parameter C (0.5, 1.0, 2.0) are optimized. Because the SVC did not converge for data preprocessing hyperparameter combinations with no quantile transformation and no standard scaling (102 data preprocessing hyperparameter combinations), this combinations are excluded for the optimization of the SVC. For KNN, the number of neighbors (2, 5, 10), the algorithm (BallTree, KDTree) and the leaf size (15, 30, 60) are optimized. For GNB and DT, no hyperparameters are optimized. The regularization hyperparameters C for LogReg is set to 0.5, 1.0, 2.0 and the maximum number of iterations is fixed to 1000. The number of estimators for RF is optimized for 2, 5, 10, 50, 100, 200.
A grid search for all possible combinations of data preprocessing hyperparameters, used feature selection method together with the number of features to select and the used classifier results in 2346 different training setups. The best performing one on the validation data is the final model and the accuracy on the test data is regarded as the final accuracy.

Domain gap and generalization
In this section, we examine whether a domain gap between the Nikiforov dataset and the Tharun and Thompson dataset exists. This is done to improve the interpretability of the generalization ability experiments carried out afterwards. The generalization ability of the classification methods is analyzed by training the methods on one dataset and applying them to the other one.

Domain gap
Although the images in both datasets are hematoxylin and eosin stained and show thyroid gland tumors, there is still a domain gap, i.e., due to the used microscope and imaging settings. Furthermore, the tumor subtypes in the datasets can differ. The images in the Nikiforov dataset have a resolution of 0.49 µm/px and the images in the Tharun and Thompson dataset have a resolution of 0.23 µm/px. For the feature-based classification, the Nikiforov dataset has been upscaled by the factor of two. This has been done to use the same segmentation network for the Nikiforov dataset and the Tharun and Thompson dataset. Upscaling by the factor of two leads to a resolution of 0.245 µm/px which still differs by 6.5% compared to the Tharun and Thompson dataset.
To examine whether a domain gap is present, we used a ResNet50 pretrained on ImageNet and applied it to the images of both datasets. The 2048 output features of the second to last layer are used to describe each image. To visualize the features, the dimensionality is reduced from 2048 features to two features per image by the unsupervised dimension reduction approach UMAP [3].
To match the resolutions between both datasets, the experiment is performed once with the Tharun and Thompson dataset downscaled by the factor of two and once with the Nikiforov dataset upscaled by the factor of two. The result is shown in Figure S3. Although, the ResNet50 was not trained on histopathological data, UMAP was able to clearly separate both datasets.
Both datasets are mostly represented by compact, separate distributions in the reduced feature space. Some outliers exist in Figure S3a for the Tharun and Thompson dataset. In Figure S3b, some samples of the Nikiforov dataset are in the Tharun and Thompson dataset distribution. Visual inspection of this samples yielded no abnormalities compared to the remaining dataset. Therefore, it can be concluded that a domain gap is present for both datasets.

Generalization ability
To evaluate the accuracy of the trained classification approaches on new datasets, the Tharun and Thompson dataset is used for training while the Nikiforov dataset is used for evaluation and vice versa.
Feature-based classification For the evaluation of the generalization ability, we use the same trained classifiers performing best for the evaluation on a single dataset. For the Tharun and Thompson dataset this is a SVC with quantile transformation, no standard scaling and sequential feature selection with a total of 22 features selected. For the Nikiforov dataset, this is a SVC with quantile transformation, standard scaling and χ 2 feature selection with a total of 22 features selected. Because the classifier hyperparameters are optimized with nested cross-validation with five splits, the final accuracy is the average of the performance of five different trained models. For the evaluation of the generalization ability, we use the same five trained models used for the accuracy on one dataset and apply them to all samples of the other dataset. The final prediction is defined by the majority vote of the five models. The results are shown in Table S4a.
Using the models trained on the Tharun and Thompson dataset and applying them to the Nikiforov dataset results in an accuracy of 66.2%. This is 17.3% lower than the best feature-based classification trained and evaluated directly on the Nikiforov dataset.
Using the models trained on the Nikiforov dataset and applying them to the Tharun and Thompson dataset results in an accuracy of 71.2% which is 18.5% lower than the best feature-based classification trained and evaluated directly on the Tharun and Thompson dataset.
Since the pixel resolution of both datasets differs by 6.5% after scaling and shape features like the area are among the most important features, this could be a reason for the decrease in accuracy. In the future, this could possibly be reduced by scaling the final extracted features regarding the training dataset resolution. Furthermore, e.g. image scaling, microscope properties, illumination and sample preprocessing can affect the color features. Change in color can subsequently change the cell segmentation and therefore the shape and spatial features. Reducing this impact is not trivial and we aim to investigate this in the future.
Deep learning-based classification Similar to the feature-based classification, the accuracy on a single dataset is the average of five different trained models. This is due to the fact, that a cross-validation with five splits is used for the deep learning-based classification. Again, all five models trained on the different splits of one dataset are applied to the other dataset and the final prediction is defined by the majority vote of the five models. Since the models are trained on different resolutions, the test dataset is scaled according to the training dataset. The results are shown in Table S4b. The models trained on the Tharun and Thompson dataset and applied to the upscaled Nikiforov dataset result in an accuracy of 62.4% which is 15.0% lower than the accuracy of the deep learning-based classification models trained directly on the not scaled Nikiforov dataset.
The models trained on the Nikiforov dataset and applied to the downscaled Tharun and Thompson dataset result in an accuracy of 70.5% which is 18.6% lower than the accuracy of the deep learning-based classification models trained directly on the not scaled Tharun and Thompson dataset.
A possibility to increase the generalization ability of the deep learning-based classification could be test time augmentation [4]. Furthermore, we think that especially the deep learning-based classification will highly benefit from a more diverse training dataset covering a larger feature space enabling it to generalize better on new data.