A comparative study on polyp classification using convolutional neural networks

Colorectal cancer is the third most common cancer diagnosed in both men and women in the United States. Most colorectal cancers start as a growth on the inner lining of the colon or rectum, called ‘polyp’. Not all polyps are cancerous, but some can develop into cancer. Early detection and recognition of the type of polyps is critical to prevent cancer and change outcomes. However, visual classification of polyps is challenging due to varying illumination conditions of endoscopy, variant texture, appearance, and overlapping morphology between polyps. More importantly, evaluation of polyp patterns by gastroenterologists is subjective leading to a poor agreement among observers. Deep convolutional neural networks have proven very successful in object classification across various object categories. In this work, we compare the performance of the state-of-the-art general object classification models for polyp classification. We trained a total of six CNN models end-to-end using a dataset of 157 video sequences composed of two types of polyps: hyperplastic and adenomatous. Our results demonstrate that the state-of-the-art CNN models can successfully classify polyps with an accuracy comparable or better than reported among gastroenterologists. The results of this study can guide future research in polyp classification.


Introduction
Colorectal cancer is the third most common cancer diagnosed in both men and women in the united states [1]. According to the American Cancer Society, a total of 101,420 new cases of colon cancer and 44,180 new cases of rectal cancer occurred in 2019. The lifetime risk of developing colorectal cancer is about 4.99% for men and 4.15% for women [1]. Colorectal cancer is the second leading cause of cancer-related deaths. Colon cancer is expected to cause about 51,020 death in the United States during 2020.
Polyps are considered the harbinger of colorectal cancer. Early detection and recognition of polyps can reduce death caused by colorectal cancers. Broadly speaking, colorectal polyps can be divided into two categories: non-neoplastic (Hyperplastic) and neoplastic (Adenomatous) [2]. Hyperplastic polyps do not predispose to cancer, whereas adenomatous polyps are considered pre-cancerous as they account for approximately 85% [3] of sporadic colorectal cancers via the adenoma-carcinoma pathway. Therefore, adenomatous polyps are removed during colonoscopy to prevent future cancer. Therefore, differentiating the two types of polyp histology is critical to determine which patient needs close follow up at shorter intervals and which patient can be surveyed every 10 years. Colonoscopy is the main diagnostic procedure to detect and recognize polyps located on colorectal walls. The accurate detection and correct classification depend on the skills and experience of the endoscopists, however, even for experienced endoscopists, working on conventional colonoscopy for long hours leads to mental and physical fatigue and degraded analysis and diagnosis. Other factors that may affect the classification results include varying illumination conditions, variant texture and appearance, and occlusion. Moreover, different types of polyps are hard to differentiate since they may exhibit a very similar appearance with a subtle difference, as shown in Fig 1. It requires a thorough examination of fine details to distinguish one category form the other. Therefore, an accurate and effective automatic computeraided system for colonoscopy is required to help endoscopists to detect and classify the type of polyps. This automated recognition mechanism can also be used as a second opinion to determine whether a further biopsy is required for diagnosis, which in turn will greatly reduce the cost of diagnosis. In addition, such an intelligent system can also be used as an educational resource for gastroenterology trainees to reduce the learning curve and cost.
In recent years, deep learning algorithms have shown their outstanding performance on various generic datasets [4]. In some computer vision tasks, including strategic board games, Atari games, and generic object recognition, deep learning even outperforms human accuracy. However, there is a significant difference between generic images and medical images, as medical images contain more quantitative information and the object have no canonical orientation. In addition, acquiring medical data is expensive and labeling them requires the involvement of domain experts. In this work, although we have used a total of 27,048 images to train our models, they are extracted from only 119 video sequences with each sequence contains one polyp. In short, we have only 119 different polyp images taken from various viewpoints with varying lighting conditions to train our models.
Based on the result of our previous study [5,6] and the results of MICCAI Endoscopic Vision Challenge [7], we can see that the state-of-the-art object detection models can already yield a very high precision in polyp detection. In this study, we assume the polyps have been detected and focus our study only on classification.
In our previous work [6], we have collected and annotated a collection of endoscopic dataset, which contains 157 video sequences and a total of 35,981 frames. We have also labeled the ground truth of the polyp location and histogram class. In order to evaluate the performance of different classification models, we generate two polyp datasets from the annotated endoscopic dataset. As shown in Fig 2, one dataset (set-1) only contains the cropped polyp patches from the original video frames; the other dataset (set-2) contains not only the cropped polyps but also around 55% background around the polyps. As described in [8], polyps have different surrounding and vascular patterns and color in vessels and background according to the type of polyps. Therefore, we generate set-2 to study the effect of background features [8] in polyp classification. Fig 2 illustrates the difference between the two generated datasets. We have evaluated and compared the performance of six classification models on these two datasets. Our results show that there is no significant difference in classification accuracy between the two datasets. We have also analyzed the performance based on both individual frames and individual sequences. The major contribution of this work include: • We have generated two datasets for polyp classification. To the best of our knowledge, there are no such datasets available in the literature, • we have implemented six state-of-the-art deep learning-based image classification models and compared their performance on the two datasets. This is the first comparative evaluation for polyp classification using different convolutional neural network (CNN) models. • This study can serve as a baseline for future studies on polyp classification. The trained classification models, as well as the test dataset will be available for free to the research community on the author's website.

Related work
Various approaches and models have been proposed for polyp detection in colonoscopy. Previous comparative validation study on MICCAI 2015 polyp detection challenge shows the proposed models using handcrafted features as well as deep learning models. However, to the best of our knowledge, most previous works were focused on polyp detection, rather than classification, due to the unavailability of the dataset. There have been very few models proposed for polyp classification which classify the polyp into the hyperplastic and adenomatous type. Previous polyp classification approaches can be broadly divided into two categories: handcrafted feature based and deep learning based model.

Conventional computer vision approaches:
Most of the polyp classification work in the literature are based on handcrafted features. Some approaches employ a pit pattern classification scheme to classify the polyp [9] into two classes: normal mucosa and hyperplastic. Hafner et al. [10] went beyond the conventional pit patterns approach and exploited fractal dimension based (LFD) strategy. Uhl et al. proposed a blob-adapted local fractal dimension(BA-LFD) approach [11] to classifying polyps. Maximal-minimal filter bank strategy proposed by [12] outperformed the BA-LFD based approach.
Neural network based approaches: The study [13] provided a first review of various deep learning based models for polyp classification. They compared the performance of VGG-VD [14], CNN-F [15], CNN-M [15], CNN-S [15], AlexNet [16], and GoogleLeNet [17] on i-Scan1, i-Scan2 and i-Scan3 database. The paper [18] utilized CNN model to classify the polyp, but in their experiments they employed whole side images instead. The study [19] classified the polyps into informative and non informative categories instead of hyperplastic and adenomatous.
Deep learning models: Inspired by the success of AlexNet [16] in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012, convolutional neural networks (CNN) have attracted a lot of attention and been successfully applied to image classification [20][21][22], object detection [4,23,24], depth estimation [25,26], image transformation [27,28], and crowd counting [29]citesajid2020plug. VGGNets [14], and GoogleNet [17], the ILSVRC winners of 2014 and 2015, proved that deeper models could significantly increase the ability of representations. ResNet [30] proposed a skip connection based residual module to solve the vanishing gradient problem of very deep models. Highway networks [31] proposed a gating mechanism to regulate the flow of information in short-connections. ResNetxt [32] proposed to employ multi-branch architecture and proved the cardinality as an essential factor in the CNN architecture. Huang et al. proposed DenseNet [33] where each layer is connected to all subsequent layers. The winner of ILSVRC 2017, SENet [34], achieved 82.7% top-1 accuracy by improving channel interdependencies at almost no computational cost. Recently, EfficientNet [35] has been proposed, which introduced a new scaling method for CNN and achieved improved performance.
Most of the proposed CNN models are based on the following three approaches: (1) Increasing the depth (number of layers) and/or width of the block architecture; (2) introducing an attention module; and (3) using a neural architecture search mechanism. The models chosen in this work are the classical models using all these three approaches. In the task of object detection, classification models are used as a backbone network, and the performance of object detection largely relied on the backbone network. The most widely adopted backbone networks including VGG, ResNet, and DenseNet. Therefore, we include all these three models in our study. In addition, we also include SENet and MnasNet. SENet employs a novel channel-wise attention mechanism, while MnasNet uses a neural architecture search. These models will demonstrate the performance of the state-of-the-art CNN models in polyp classification.

Materials and methods
Convolutional neural networks have been widely applied to various computer vision tasks including object detection and classification. A general CNN network consists of different blocks, including an input layer, an output layer, and a number of hidden layers made up of convolution layers, pooling layers, and activation layers. CNNs adaptively learn spatial hierarchies of features via back propagation through these building blocks. In this section, we make a brief review of the classical object classification models used in this comparative study. These models include VGG [14], ResNet [30], DenseNet [33], Squeeze-and-Excitation Network (SENet) [34] and MnasNet [36].

VGG
VGG Net [14] was proposed by Simonyan and Zisserman to improve the classification performance by adding more convolutional layers to increase the depth of the network. This could be possible by replacing a large filter size (11 × 11 and 5 × 5) with 3 × 3 multiple kernel sized filter stacked together. Max pooling layer is used to reduce spatial dimensions at every few layers. There are three back-to-back fully connected and a softmax layer respectively followed by stacking the 3 × 3 convolution layers at the end. VGG is the first network structure that adopts block-based architecture. ReLU non-linearity has been added to all hidden layers. The number of weight parameters in VGG is larger than the previously proposed AlexNet, though it takes fewer epochs to converge because of implicit regularization imposed by its depth and small convolution filter size.

ResNet
To address the problem of vanishing gradients in deep neural networks, He at al. [30] proposed ResNet which was implemented using the idea of Residual-Blocks, with skip connection to fit the input from the previous layer to the next layer without modifying it. In addition, the residual block structure was structured for different deep variants of ResNet, ResNet-50, and ResNet-101, by including bottleneck design. For each residual block, they used a stack of 3 layers instead of 2 layers, which includes 1 × 1 convolution layer back and forth of 3 × 3 layer. Here 1 × 1 layer is responsible for adjusting the dimensions. Though ResNet is deeper than the VGG net, it has fewer filters and lower complexity. ResNet-34 has 3.6 billion Flops which is only 18% of VGG-19.

DenseNet
Huang at al. [33] proposed DenseNet based on the observation that deep network is efficient to train if they contain shorter connections between layers close to the input and layers close to the output. DenseNet is made up of several dense blocks and the feature maps from all previous layers are used as an input, and its own feature map is used as input to all subsequent layers. DenseNet uses concatenation operation to add the features from previous layers instead of using element-wise addition. In DenseNet, each layer has fewer number of filters(12 filters), which makes the network thinner and compact. In addition to fewer weight parameters, DenseNet is easy to train because of improved information flow and gradients throughout the network.
As each layer produces k feature maps. 1 × 1 convolution layer is used to reduce the number of input feature map before applying it to a 3 × 3 convolution layer. With this unique design architecture, DenseNet has succeeded to reduce the vanishing gradient problem as well as strengthen feature propagation and encourage feature reuse.

SENet
Researchers have tried to improve the accuracy by stacking layers in different ways. Hu at al. [34] proposed a new architecture block squeeze and excitation based on the observation that not all feature maps are equally important. In conventional convolutional networks, the output feature maps are equally weighted, whereas SENet block weights each channel adaptively in a kind of content-aware mechanism. In more formal terms: SE block employs global information to selectively emphasize informative features and suppress less useful ones. The SE block is made up of two different operations: Squeeze and excitation. The squeeze operation uses global average pooling to generate channel-wise statistics which is a n-dimensional feature vector where n is the number of channels. The excitation operation utilizes this n-dimensional feature vector, passes through two fully connected layers, and generates the same length vector. This resultant vector is used to weight the original feature maps. This squeeze and excitation block can be embedded into any state-of-the-art object classification models at a slightly additional cost. The squeeze and excitation network won the first place in ILSVRC 2017 classification and reduced the top-5 error to 2.251%.

MnasNet
MnasNet [36], proposed by Google Brain, is an automated mobile neural architecture search approach, based on reinforcement learning, which can identify a model that could achieve a good trade-off between accuracy and latency. MnasNet introduced a hierarchical search space that provides layer diversity throughout the network instead of repeatedly stack the same cells through the network. The main components of MnasNet include (i) RNN-Controller used for sampling model architecture; (ii) a trainer used to trained model sampled by RNN-controller; and (iii) a mobile phone-based inference engine for measuring latency. MnasNet has been implemented on the ImageNet [37] and COCO [38] database. In this work, we used the architecture which was searched by MnasNet on the ImageNet [37] dataset.

Implementation
Dataset preparation. In order to evaluate the performance of different models on the classification of polyps. We collected and labelled the following datasets. With the help of three endoscopists from the medical school of Jilin University and the University of Kansas Medical Center, we labeled the polyp classification results of all videos in datasets 1, 2, and 4. We also annotated the location bounding boxes for all the polyps in datasets 3 and 4. During the annotation process, the endoscopists could not reach an agreement on some sequences since they may need further biopsy verification. Those videos are removed from the datasets. We finally obtained a dataset of 157 videos (35,981 frames) with the labeled ground truth of the polyp histology and bounding boxes.
For the labeled dataset, we randomly split all the videos into training, validation, and test sets which contains 119, 16, and 22 video sequences, respectively. The study focuses on evaluating the performance of the state-of-the-art classification models. We assume the polyps have been accurately detected and generate two separate datasets for the evaluation. As shown in Fig 2, set-1 only contains the patches of the cropped polyps, and set-2 contains not only the cropped polyps but also about 55% background around the polyps.

Training
In this study, we implemented and compared a total of 6 classical classification models: VGG19 with/without batch normalization [14], ResNet50 [30], DenseNet121 [33], SE-Res-Net50 [34] and MnasNet [36]. The training dataset contains 119 sequences (27,048 images). We train all the model using NVIDIA Tesla K80 or P100 GPUs. The hyperparameters used to train the models are tabulated in Table 1. All models were initialized by pre-trained ImageNet weights and the training time of each model ranges from 1 to 3 hours.

Evaluation metrics
In the experiments, we train each model until it achieves the optimal performance on the validation set. To evaluate the model performance, we calculate the top-1 classification error. In order to make a fair comparison of different models, the performance has also been evaluated in terms of sensitivity, specificity, accuracy, precision, and F1-Score. The definitions of these matrices are listed in Table 2. We evaluates the performance of all models on each sequences individually for both datasets.

Results
In this section, we report the classification results of all comparative models using the two datasets. All input images are resized to 224 × 224 for a fair comparison. All models include batch normalization except VGG-19. The test set contains a total of 22 sequences (4719 frames), where 13 sequences (2890 frames) belong to adenomatous and 9 sequences (1829 frames) belong to hyperplastic. All models employ softmax as the classifier to yield the scores for the two classes, and the model outputs the class corresponding to the higher score. The top-1 error, precision, recall (individual class accuracy), and F1-score for both categories are as shown in Table 3. To alleviate the influence of the variation of illumination, all images in the Overall performance of all model on set-1 and set-2 based on individual frame irrespective of sequence.
https://doi.org/10.1371/journal.pone.0236452.t003 datasets were normalized with respect to their mean and standard deviation. The mean and standard deviation of both datasets are listed in Table 4.

Frame-based performance
We first report the comparative performance of different models based on each individual frame. Frame-based performance is measured without considering the particular sequence of those frames. It measures the overall accuracy similar to the generic classification evaluation for other datasets. As shown in Table 3, VGG19 outperforms all other models with an overall accuracy of 75.71% and 79.78% for set-1 and set-2, respectively. The precision of Adenomatous class is higher than that of Hyperplastic class for every model in both datasets, except for VGG-19 with batch normalization (on set-1) and ResNet50 (on set-2). If we consider precision and F1-score for every model in both datasets, the precision of Adenomatous is always higher than that of Hyperplastic. VGG-19 has also achieved the highest recall for both classes on set-2. The most recently proposed models, like ResNet, SENet, and MnasNet did not perform well in both datasets, although they have better performance than VGG-19 on generic image classification datasets. From Table 3 we also observe that VGG-19 outperforms VGG-19 with batch normalization in most metrics. This is contradicting to what was observed in other datasets. The reason might because that, in polyp classification, the exact intensity values of the pixels may be more useful for the discrimination of different types of polyps than that in generic image classification. While batch normalization layer scales the pixel values with respect to the batches, which may affect the intensity information and downgrade the performance.
To better visulize the performance, we employ AUC (area under the curve) ROC (receiver operating characteristics) curve to demonstrate the frame-based performance. AUC-ROC curve represents the degree of separability of a classification problem. It demonstrates the capability of a model in differentiating classes. Figs 3 and 4 show the ROC curves of different models for set-1 and set-2, respectively. The results show that, in general, the models achieve better classification performance on set-2 than that on set-1 except for ResNet. We can also see that VGG-19 achieves the highest ROC score and the best accuracy on set-2.

Sequence-based performance
Based on the classification of each frame, we can measure the performance of each sequence. The sequence-by-sequence performance for the two datasets are shown in Figs 5 and 6, respectively. We can see that the results are not consistent among all frames within the same sequence of the same polyp. This is because the appearance of the polyp may subject to significant appearance changes due to the variance of the viewpoints, zooming scales, and illumination. Fig 7 shows some sample frames of a sequence under different viewpoints and lighting conditions. In this case, even experienced endoscopists cannot make an accurate prediction from a single frame. As a result, not all frames can be correctly classified. In practice, we calculate the percentage of correctly classified frames for each sequence. Then, we set a threshold in terms of the percentage, and a sequence is considered to be correctly classified if the percentage of correctly classified frames is greater than the specified threshold. Table 5 shows the performance corresponding to different thresholds for the two datasets. As shown in Figs 5 and 6, the classification result for each sequence is not consistent. The test sequences 1, 3, 10, 12, 13, 14, 18, 19, 21, and 22 are correctly classified by all models for both datasets, while the results of sequences 2,4,5,6,7,9,11,17, and 20 are not consistent because the percentage of the correctly classified frames is in between 40-50%. Sequences 5 and 6 could not be classified well by all models. Some sample frames of sequences 5 and 6 are shown in Fig 8, which subject large variations in appearance that cause the difficulty in classification. Table 5 shows the threshold-based performance of all models. The results indicate the consistency of the prediction of different models, from which we can see that VGG models achieve relatively better performance than other models. For example, VGG-19 achieves around 70%, 80%, and 90% accuracy at the thresholds of 70%, 60%, and 50%, respectively. Comparing Tables 3 and 5, we can find that if we set the threshold at 50%, the sequence-based accuracy is much higher than frame-based based accuracy, especially for VGG models. However, at a higher threshold of 70%, the overall accuracy of the frame-based is higher than the sequence-based approaches, which indicates the consistent prediction within the sequence. To better visualize the sequence-based performance, we have included the box plots. Box plots show the accuracy per sequence distribution of the total 22 sequences. Fig 9 shows the box plots of different models on set-1 and set-2, respectively. It can be seen that the maximum accuracy of all models is 100% because at least one sequence has been correctly classified  by each of the models. The upper quartile range is dependent on the median value. A high median value decreases the upper half range, which shows the ability of the model to consistently correctly classified sequence. On set-1, VGG-19 achieves the highest median value, which indicates that half of the sequences are correctly classified with a very high threshold. On set-2, ResNet-50 yields the most consistent results with the highest median value. We can also see from the results that the upper quartile ranges are smaller than the lower quartile range, which indicates that the spread of accuracy below the median value is very high.   Accuracy per sequence for all models based on different threshold with set-1 / set-2. First term before '/' specifies accuracy for set-1 and and term after '/' indicates accuracy for set-2.

Polyp crops vs crops with background
In order to test the background information in polyp classification, we generate two datasets in the experiment, set-1 has only polyp crops and set-2 contains polyp crops with 50% background. From Table 3 we can see that, if we consider frame-based performance, except for the VGG models, all other models achieve higher accuracy on set-1 than on set-2. If we consider the overall AUC-ROC score, set-2 yields better performance which means the two classes are easier to distinguish in set-2 than in set-1. If we consider sequence-based analysis, the performance of all sequences is almost similar for both types of datasets. For consistency-based performance, the consistency is improved by VGG-19, VGG-19 with batch normalization, and DenseNet for set-2, whereas for other models, the overall threshold-based accuracy is very close. If we consider the box plots and set median as a threshold, the consistency of correctly classifying sequence is improved by ResNet, DenseNet, and SENet for set-2.

Conclusion
In this paper, we have established two datasets and compared six state-of-the-art deep learning-based classification models. We have evaluated the results both at the frame level and at the polyp level. Our results show that VGG-19, in general, outperforms other models in both cases for both datasets. While some more advanced classification models, like ResNet, Dense-Net, SENet, and MnasNet did not perform well in our experiments, though they have advantages on other benchmark datasets. The poor performance may be caused by the limited size of the polyp dataset. This study provides a good baseline for future research to develop more accurate and more robust polyp classification models.