Colonoscopy polyp detection and classification: Dataset creation and comparative evaluations

Colorectal cancer (CRC) is one of the most common types of cancer with a high mortality rate. Colonoscopy is the preferred procedure for CRC screening and has proven to be effective in reducing CRC mortality. Thus, a reliable computer-aided polyp detection and classification system can significantly increase the effectiveness of colonoscopy. In this paper, we create an endoscopic dataset collected from various sources and annotate the ground truth of polyp location and classification results with the help of experienced gastroenterologists. The dataset can serve as a benchmark platform to train and evaluate the machine learning models for polyp classification. We have also compared the performance of eight state-of-the-art deep learning-based object detection models. The results demonstrate that deep CNN models are promising in CRC screening. This work can serve as a baseline for future research in polyp detection and classification.

rely on early detection is due to the nature of the symptoms and development of CRC.Although no symptoms can be easily observed before the tumor reaches a certain size (typically several centimeters) [5], it would typically take several years to as long as a decade for CRC to develop [6], starting from precancerous polyps.Both facts add up to show the significance and potential of diagnosing CRC by regular screening at an early stage, even before polyps become cancerous.

CRC screening options
There are several common CRC screening options, which can be roughly divided into two categories: visual examinations and stool-based tests.Each method has its advantages and limitations.The evaluation needs to take into account a broad range of factors including statistic data and psychological effects.The most important metric, like many other screening tests, is 'sensitivity' [5], which is also called 'recall' in some other fields, determined as the percentage of patients with the disease that is actually detected.From sensitivity, we know the possibility of a patient walking out of the clinic with lesion undetected, the consequences of which are severe.Therefore, in many instances, it is the single most important metric to optimize.
Another statistical measure that often comes along with sensitivity is 'specificity', which is measured as the fraction of healthy people that are correctly identified.It indicates the potential of a test to falsely detect lesions in healthy clients.This will cause mental stress on the clients, and the following treatment might result in unnecessary physical harm and financial burdens.Thus, a high specificity test is also preferred.For a screening method in real clinical settings, there is generally a trade-off between the sensitivity and the specificity.With the consequences of missing a lesion much more grave than false diagnosis, sensitivity is usually preferred over specificity.A higher specificity screening can always follow a high sensitivity test to filter out the falsely diagnosed cases [5].Other factors include how easy the preparation is, how accessible the facility is, how much the test costs, etc.Since individuals need to be screened are oftentimes asymptomatic, the experience will affect their compliance, which is an important part of an effective screening program [5].In the following session, some common CRC screen methods and their properties are discussed.
Colonoscopy is the recommended CRC visual examination screening method.The advantages of colonoscopy include high sensitivity, ability to remove lesions at detection and full access to proximal and distal portions of the colon [5].The colonoscopy can reach a sensitivity of 95% in detecting CRC according to Rex et al. [7].The disadvantages are mostly related to the way colonoscopy is conducted [5,8].At least one day before the test, it requires a complicated bowel preparation, which requires the participant to change diet and take medicine to cause diarrhea.During the test, sedation or anesthesia might be performed, and there is a risk of post colonoscopy bleeding.Thus, the suggested 10-year screening interval has a low compliance rate [5].Narrow-Band Imaging (NBI) is a newly developed technique by modifying light source using optical filters in an endoscope system [9].Compared to normal colonoscopy, intensified lights of a certain wavelength can better present the mucosal morphology and vascular pattern [10].Studies show that NBI perform better in CRC detection than conventional colonoscopy [9,10].
Computed Tomography (CT) Colonoscopy is a structural radiologic examination that employs software to reconstruct 3D views of the entire colon to detect lesions.Although it has a slightly less sensitivity of > 90%, the less-invasive nature of CT colonoscopy results in a higher participation rate [11,12].The limitations include unpleasant bowel preparation before the test, uncomfortable inflation of colon with air during the test, and safety concerns over the use of radiation.Compared to colonoscopy, CT colonoscopy is not studied thoroughly, e.g., uncertain screening interval [5].With the fact that CT colonoscopy requiring follow-up colonoscopy with lesion detected and its sensitivity highly dependant on radiologists' expertise, it is only recommended to individuals whose physical conditions are not fit for the invasive examination of the colon [5].A similar screening method, double-contrast Barium Enema, is also not recommended due to similar limitations and even more complicated procedures [13].
Sigmoidoscopy is similar to colonoscopy, but it can only access the distal part of the colon.It shares the same high sensitivity as colonoscopy and can remove lesion at the detection.In addition, it requires less complicated bowel preparation and usually does not need sedation [13].However, sigmoidoscopy has limited accessibility to only the distal colon rather than the proximal part, making it less effective due to the higher risk of proximal CRC among elder individuals and women [13].Therefore, it is recommended to pair sigmoidoscopy with other screen methods [5].
Wireless Capsule Endoscopy uses a miniaturized camera in a swallowable capsule to transmit gastrointestinal images to portable receiver units that can be easily worn [14].Although a typical examination takes about 7 hours [15], the process does not impact patients' life quality compared to other methods.This wireless capsule can also examine the entire small bowel that is not accessible to other endoscopy practices [15].Nevertheless, wireless capsule endoscopy has some drawbacks as well.For example, it has no therapeutic capability [15].Also, it does not take images in distended bowel as other methods [15], the practitioners need training to interpret the images.
Fecal Occult Blood Test (FOBT) and Fecal Immunochemical Test (FIT) both detect hemoglobin in the stool to indicate if a lesion exists.Both tests are non-invasive and easy to be carried out even at home, but their sensitivities suffer for earlier stages of lesions due to less frequent bleeding [5].In addition, some dietary intakes can alter the test results, reducing the performance of FOBT and FIT.
There are other screening tests like the DNA test, wireless capsule endoscopy, etc.However, due to low sensitivity and lack of sufficient supportive studies, they normally need a subsequent colonoscopy when the result is positive.

Goals
As the reference CRC screening test, colonoscopy has obvious advantages over its alternatives.However, its performance depends on several variables, like the bowel preparation, the number of polyps, and the part of the colon where the polyps are located [16][17][18].Furthermore, human-factors can influence the screening sensitivity and specificity.Inexperienced gastroenterologists have higher miss detection rates compared to those who are well-trained.According to Leufkens et al. [17], participants before training showed significantly lower performance than post-training results.Colonoscopy is also subjected to the physical and mental fatigue of the gastroenterologists.The screening process requires prolonged concentration and is usually repeated throughout the day.A study by Chan et al. [16] showed that 20% more polyps are detected from early morning screenings.
It is obvious that a fine-grained deep learning framework to automatically detect polyps is needed to help physicians locate and classify the lesions.This deep learning framework can assist physicians during screening in real-time and prompt the detected region and polyp category.Thus, such a computer-aided system can help eliminate the miss rate due to physical and mental fatigue and allow the gastroenterologists to focus on regions where lesions actually exist.This automated system also ensures high performance in clinics where access to experienced gastroenterologists is difficult.An accurate detection system can also improve the detection rate of smaller pre-cancerous polyps using the Convolutional Neural Network (CNN) models.The sensitivity of current colonoscopy suffers as the size of the colon becomes smaller [5,6,13].This can August 9, 2021 be improved because the state-of-the-art CNN models can extract features from objects at different scales.
Deep learning models require larger datasets to exploit its full potential.Recent benchmark datasets for general computer vision tasks all have more than 10k images [19].We want to build a polyp classification dataset based on the videos from the colonoscopy procedure with a reasonable number of samples to train deep neural network models.The images in the dataset contain polyps from different stages and are representative of different types of polyps.We will label each frame with accurate polyp locations and categories.Although constructing such a dataset is time-consuming and labor-intensive, it will benefit the research community to develop more accurate and robust deep learning models to achieve a higher detection rate and to reduce CRC mortality rate.The dataset could also standardize and facilitate the training of medical professionals in endoscopy.
Using the developed dataset, we have evaluated and compared the performance of the state-of-the-art deep learning models for polyp detection and classification.The dataset and the corresponding annotations can be downloaded via https://doi.org/10.7910/DVN/FCBUOR.

Related Work
Deep learning has achieved more and more attention in recent years with wide applications across a variety of areas.It boosts the performance by a significant margin in tasks like computer vision, speech recognition, natural language processing, data analysis, etc. [20][21][22][23][24][25].The success is largely owing to the development of deep Convolutional Neural Networks (CNN) which have been proven to be especially effective in extracting high-level features.Among all these areas, deep learning has achieved huge success in computer vision applications, with early CNN models almost halving error rate in the ImageNet classification challenge compared to classic models [21].In recent years, CNN-based models have demonstrated their outstanding capabilities in many complicated vision tasks, like object detection, image segmentation, object tracking, etc. [26][27][28][29][30].

Computer Vision in Medical Applications
Researchers have been trying to use computer vision techniques in medical applications as early as 1970 [31].At that time, image processing was only a low-level task like edge finding and basic shape fitting.As the handcrafted models became more sophisticated, some studies showed success in areas like salient object detection and segmentation [32,33].The ability of these models to analyze the surface pattern and appearance prompts their application in a wide range of medical fields, such as neuro, retinal, digital pathology, cardiac, and abdominal [31].Bernal et al. [34] proposed a model that considers polyps as protruding surfaces and utilize valley information along with completeness, robustness against spurious responses, continuity, and concavity boundary constraints to generate energy map related to the likelihood of polyp presence.In the study [35], the model exploits the color feature extraction scheme based on wavelet decomposition and then uses linear discriminant analysis to classify the region of interest.Other handcrafted feature approaches can be found in [36].
The limiting factor of hand-engineering models is the need for researchers to understand and design filters.They tend to perform better for low-level features.Deep learning models can automatically generate parameters with deeper layers and extract high-level semantic features.Especially in recent years, many new models [37,38] and techniques [39][40][41][42] have been published to set new records in various computer vision tasks.[43] employs multi-scale architecture with 3 layers of CNN and 3 layers of max-pooling followed by fully connected layers.Another model uses a slightly different approach using 3 different extracted features, color and texture clues, temporal features, and shape to feed an ensemble of 3 CNN models [44].Deep learning models have been widely applied to medical problems like anatomical classification, lesion detection, and polyp detection and classification in colonoscopy [45][46][47][48][49][50].In [45], Six classical image classification models have been compared to determine the categories of detected polyps.It assumes all polyps have been detected and cropped out from the original sequences.An enhanced U-Net structure has been proposed in [51] for polyp segmentation.In this paper, we focus on polyp detection from the endoscopic sequences to assist gastroenterologists in both polyp detection and classification.We evaluate and benchmark the state-of-the-art detection models for colonoscopy images.

Object Detection
Different computer vision techniques can be adapted to perform polyp detection, such as object detection, segmentation, and tracking.Object detection takes images as input and generate classification results of objects presented in the images and their corresponding location information.Object locations are most commonly defined by rectangular bounding boxes.The output of image segmentation contains more detailed information, such as the classification result for each pixel in the original photo, while object detection usually only produces the coordinates of four corners of each bounding box.Thus, image segmentation is usually more time-consuming.In practice, pixel-level classification is not necessary for polyp detection and classification.In this study, we focus on object detection techniques.The state-of-the-art deep learning-based object detection models can be broadly classified into two main categories: two-stage detectors and one-stage detectors.
Two-stage detector consists of a region proposal stage, followed by a classification stage.Each of the two stages has its own dedicated deep CNN, which generally produces higher accuracy compared to one-stage detectors.However, this also leads to more processing time.The region proposal stage used to be the bottleneck as it is often a slow process, while the state-of-the-art two-stage detectors adopt new structures sharing part of the CNN to speed up the processing time for real-time applications [52].
One-stage detector gets rid of the region proposal stage and fuses it with the classification stage, resulting in a one-stage framework.It directly predicts bounding boxes by densely sampling the entire image in a single network pass.With simpler architecture, it often achieves real-time performance.Although earlier models had lower detection accuracy than the two-stage detectors, they are catching up and now can produce comparable results.

Evaluation Models for Detection and Classification
In this section, we will make a brief introduction of eight state-of-the-art object detection and classification models that are implemented and evaluated in this comparative study.
Faster RCNN [52]: Faster RCNN is a two-stage framework model and one of the families of RCNN networks [53,54].It improves the Fast RCNN network by replacing the slow selective search algorithm with a region proposal network, resulting in a faster detection rate.Furthermore, the region proposal network is trainable, which can potentially achieve better performance.
Faster RCNN is mainly composed of two modules, the region proposal network (RPN) module and the classification module, as shown in The features are then shared by both the RPN module and the classification module.In the RPN branch, a sliding window will be applied to regress the bounding box locations and probability scores of object and non-object.At each location, the sliding window predicts k pre-defined anchor boxes, centered at itself with different sizes and ratios to achieve multi-scale learning.With the introduction of RPN, the inference time on PASCAL VOC is reduced to 198ms on a K40 GPU with VGG-16 as the backbone [52].Compared to the selective search, it is almost 10 times faster.The computational time of the proposal stage is reduced from 1, 510ms to only 10ms.Combined, the new faster R-CNN can achieve 5 frames per second (fps).
YOLOv3 [56]: YOLOv3 is an iterative improvement of YOLO (You Only Look Once).It improves the performance of its previous versions by introducing a new backbone network, multi-scale prediction, and a modified class prediction loss function.
YOLO is the first model of this YOLO series [56][57][58].It is one of the pioneering works to get rid of the region proposal stage.The detector splits the image into S × S grids.Each cell is responsible for predicting ground truth objects with centers located inside the cell, and each cell in the grid predicts B × (4 + 1 + C) values, where B is the number of anchor boxes in each cell, 4 + 1 represents the number of bounding boxes and object confidence, and C is the total number of classes.The second version, YOLOv2, and YOLO9000, introduced several optimization tricks to improve the performance like batch normalization, high-resolution classifier, new network, multi-scale training, etc.Among the optimizations, the most effective technique is dimension priors which limit the regressed bounding boxes close to its original anchors.Without it, the regressed boxes can go anywhere in the image, resulting in unstable training [58].YOLOv3 progressively developed a deeper CNN, DarkNet-53, from DarkNet-19 [56].It also predicts objects from different scales.YOLOv3 achieves real-time performance.However, it often has lower detection accuracy compared to Faster RCNN.
YOLOv4 [59]: YOLOv4 is the latest improvement of YOLO.It explores the bag of freebies and bag of specials and selects some of them in the new detection model.The basic rules for a detection model are high-resolution input images for detecting relatively small objects, deeper layers for a larger receptive field, and more parameters for detecting various objects.Based on those rules, YOLOv4 selects various effective bag of freebies and bag of specials to enhance the performance of the model while maintains high-speed inference.In addition, instead of exploiting DarkNet53 as the backbone in YOLOv3, an enhanced version of DarkNet53 (CSPDarknet53 [60]) is selected as the backbone for YOLOv4.Higher receptive field is extremely important to detectors, thus SPP [61] net is added over the backbone CSPDarknet53 [60] since this block provides larger receptive fields with almost the same inference time.YOLOv3 utilizes FPN [62] to aggregate the information from different feature levels, while YOLOv4 [59] employs PANet [63] to extract information for detector heads.Bag of freebies and bag of specials are indispensable for object detection and properly selecting and adding them to the detection models may highly boost the performance of the detectors without sacrificing too much inference cost.
SSD [64]: Single Shot Detector (SSD), as one of the most successful one-stage detectors, has become the foundation of many other studies.It takes advantage of the different sizes of feature maps and utilizes a simple architecture to generate predictions at different feature map scales.SSD can achieve a fast detection rate with competitive accuracy.
As shown in Figure 2. SSD combines multi-scale convolutional features to improve prediction.In CNN, feature maps progressively decrease in size from input to output.The layers closer to input are shallow layers which have higher resolution and are better at detecting smaller objects.While the deeper layers have lower resolutions but contain more semantic information.SSD takes advantage of this natural structure of CNN and yields comparable results for objects with all sizes.SSD is an anchor-based detector.It divides the image into m × n grids similar to the YOLO series.At each grid cell, the model will generate per-class scores and bounding box dimension offsets for each k pre-defined anchors with different ratios and scales, similar to RPN in Faster RCNN.It also introduces the use of convolutional layers for prediction which makes the detector fully convolutional, unlike YOLO [57] which uses fully connected layers for detection.
SSD makes a good trade-off between speed and accuracy.The simple one-stage framework architecture results in fast performance, achieving a real-time detection rate.Furthermore, the use of anchor boxes and multi-scale prediction enables a good detection accuracy.
RetinaNet [65]: RetinaNet is a one-stage framework based on the SSD model.RetinaNet improves performance by using the Feature Pyramid Network (FPN) [62] for feature extraction and focal loss function to solve the class imbalance problem.In the SSD model, the multi-scale prediction mechanism suffers from its architectural weakness in which high-level layers do not share information with low-level layers, thus lacking high-level semantic information in detecting smaller objects.FPN concatenate feature maps from layers at different depths to improve detection at each scale.Another major contribution of this model is the use of focal loss to solve the class imbalance problem.Class imbalance refers to the imbalance between background and foreground class.It is more extreme in one-stage models as the detector scans through the entire image indiscriminately.In practice, the candidate locations can normally go up to 100k without the filtering of the region proposal module.Therefore, the focal loss is introduced to assign higher weights to difficult foreground objects and lower weights to easy background cases.The definition of focal loss is defined in Equation 1, where balance variant, α t , and focusing parameter, γ, are two hyper-parameters and p is the estimated probability.
In Equation ( 1), p t is closer to 1 when the model is more correct (i.e., correct prediction with higher confidence score or wrong prediction with lower confidence score).With the original cross entropy loss as CE = −α t log(p t ), focal loss effectively gives it a factor (1 − p t ) γ , whose value is small when the model is correct (easy cases) and large when the model is wrong (hard cases).
DetNet [66]: DetNet is a backbone network specifically designed to extract features, different from other detectors discussed in this section.It is designed to tackle three existing problems in previous backbone networks: • Backbone networks have a different number of stages; • Feature maps used to detect large objects are usually from deeper layers, which have a larger receptive field, while they are not accurate in exacting the location due to low resolution; • Small objects are lost as the layers go deeper and resolutions become lower.
Li et al. [66] proposed DetNet-59 based on ResNet-50.It has 6 stages with the first 4 stages the same as ResNet-50.In stages 5 and 6, the spatial resolutions are fixed instead of decreasing.The fixed resolution means a convolution filter will have a smaller receptive field compared to that in a lower resolution feature maps.A dilated [67] bottleneck as shown in Figure 3(b) is used for compensation.In this paper, we apply the DetNet backbone to the Faster RCNN detector.
RefineDet [68]: RefineDet is an SSD-based detector aiming at overcoming the following three limitations in single-stage detectors compared to the two-stage ones.The architecture of RefineDet is shown in Figure 4.It consists of three modules: Anchor refinement module (ARM), transfer connection block (TCB), and object detection module (ODM).Like in SSD, ARM takes feature maps from different layers.Then from each layer, it produces coarsely adjusted anchors and binary class scores (object and non-object classes).The anchors with a non-object score greater than a certain threshold θ will be filtered out, which reduces the class imbalance.Then the TCB is designed to combine features from deeper layers to the current level ARM features by element-wise addition.Deconvolution is used to facilitate the addition by increasing the resolution of deeper layer feature maps to match the shallow features.As a result, the shallow layers will have semantic information.By taking the filtered anchors from ARM and feature maps produced by TCB, ODM regresses the already refined anchors and generates multi-class scores.The results are improved because the input of ODM contains multi-level information and it refines the predicted bounding boxes in two steps.
ATSS [69]: ATSS (Adaptive Training Sample Selection) investigates the anchor-based object detectors and anchor-free object detectors and points out that how to define positive samples and negative samples in the training process is the significant difference between the anchor-based models and anchor-free models.For instance, the  anchor-free detector FCOS [70] first finds positive candidate samples in each feature level and then selects the final positive candidates among all features, while the anchor-based RetinaNet [65] exploits IoU (Intersection over Union) between pre-defined anchors and the ground truth bounding boxes to directly select the final positive samples among all feature levels [69].Based on the analysis, ATSS automatically defines positive and negative candidates based on the statistical property of the objects in the images.
For each object on the image, ATSS selects k anchor boxes based on the closest center distance between those samples and the ground truth box on each feature level.There are a total of k × L candidate positives if the number of feature pyramid levels is L. Then the IoU between these candidate samples and the ground-truth is calculated and the mean m g and standard deviation v g are also calculated so that the IoU threshold is obtained as t g = m g + v g .Finally, the candidates whose IoU are larger than or equal to the threshold and at the same time whose centers are inside the ground-truth box are selected as the final positive samples.ATSS introduces a mechanism that dynamically selects the positive and negative samples and bridges the gap between anchor-based approaches and anchor-free approaches.

Dataset Build
The performance of a CNN model is highly dependant on the dataset.During training, a CNN model learns from a large number of examples how to extract semantic features, on which localization and classification are based.Therefore, CNN detectors perform better when the dataset consists of representative examples of all categories.For example, images that are taken from different viewpoints, various illumination conditions, multiple sizes, etc.The more representative the dataset is, the more likely the CNN models can learn meaningful features for detection and classification.Then, at the inference time, the trained CNN models will have a higher ability to generalize the feature extraction on new input images.
In the research community, there are several small collections of endoscopic video datasets for different research purposes, such as MICCAI 2017, Gastrointestinal Lesions in Regular Colonoscopy Data Set (GLRC) [71] and CVC colon DB [34] dataset.However, after careful observation and analysis, we found that these datasets differ greatly from each other in terms of resolution and color temperature, as shown in Figure 5.This is largely due to the setups and characteristics of different imaging equipment used for data collection.As pointed out in [72], two of the main reasons why current CNN models perform worse in the real-world compared to benchmark test sets are the variance in image backgrounds and image quality.As shown in Figure 5, the images in different datasets vary greatly.If we train the models using only one of these datasets, the models may have poor generalization ability, and their performance will suffer when being applied to colonoscopy images from different devices in another medical facility, as demonstrated in Section Experiments and Section Results and Analysis.More recently, there are several large dataset published on colonoscopy [73,74], like Hyper-Kvasir [75] and Kvasir-SEG [76].Hyper-Kvasir is a general-purpose dataset for gastrointestinal endoscopy.It detects 23 different classes of findings in the images and videos, including polyp, Angiectasia, Barretts, etc. [75] However, it does not provide the hyperplastic and adenomatous classification.Similarly, Kvasir-SEG provides labels in segmentation format.Thus, they could not be used to train detection models to predict polyp categories.
Another big limiting factor is the lack of distinct training examples.Although the available dataset seems to have many images, these images are actually extracted from a small number of video sequences.Each endoscopic video sequence only contains a single August 9, 2021 11/29  polyp viewed from different viewpoints.If we inspect the polyp frame by frame, we can see that most of the frames are taken from almost the identical viewpoint and distance as shown in Figure 6.Some video sequences do not have noticeable movement across 1000 frames.Thus, there are significant redundancies in these datasets, especially for polyp classification, which required a large collection of distinct videos (polyps) to train the classifier.Considering recent benchmark datasets like MS COCO [77] with over 300k distinct images, more colonoscopy data are needed to achieve reasonable performance.
In order to make the best use of the recent development of deep learning technologies for object detection.We collected and created an endoscopic dataset and compared the performance of the state-of-the-art detectors for polyp detection and classification.These datasets come from various sources and serve different purposes as will be discussed in the following subsection.To integrate them together, we refer to PASCAL VOC [19] object detection task to standardize the annotation.The dataset only contains two categories of polyps: hyperplastic and adenomatous polyps.It is important to train a model that could reliably differentiate them since adenomatous polyps are commonly considered as precancerous lesions that require resection while hyperplastic polyps are not [71].

Datasets Selection and Annotation
In this study, we have collected all publicly available endoscopic datasets in the research community, as well as collected a new dataset from the University of Kansas Medical Center.All datasets are deidentified without revealing the patient information.With the help of three endoscopists, we annotated the polyp classes of all collected video sequences and the bounding boxes of the polyp in every frame.Below is an introduction to each dataset.
MICCAI 2017: This dataset is designed for Gastrointestinal Image ANAlysis (GIANA), a sub-challenge of the Endoscopic Vision Challenge [78]  CVC colon DB: The dataset has 15 short colonoscopy videos with a total of 300 frames [34].The labels are in the form of segmentation masks, and there are no classification labels.We extracted the bounding boxes and labeled the polyp class.
GLRC Dataset: The Gastrointestinal Lesions in Regular Colonoscopy Dataset (GLRC) contains 76 short video sequences with class labels [71].There is no label for polyp location.We manually annotated the bounding box of each polyp frame by frame.
KUMC Dataset: The dataset was collected from the University of Kansas Medical Center.It contains 80 colonoscopy video sequences.We manually labeled the bounding boxes as well as the polyp classes for the entire dataset.

Frame Selection
The video sequences from these datasets consist of different numbers of frames.For example, CVC colon DB only has 300 frames in total, averaging 20 frames per video sequence, while the number of frames in MICCAI 2017 varies from 400 to more than 1000 with a median value of around 300 in each sequence.The extreme imbalance among different lesions will reduce the representativeness of the dataset.In addition, many frames in a long sequence are redundant since they are taken with very small camera movement.To avoid some long videos overwhelming others, we adopt an adaptive sampling rate to extract the frames from each video sequence based on the camera movement and video lengths to reduce the redundancy and homogenize the representativeness of each polyp.After sampling, we extracted around 300 to 500 frames for long sequences to maintain a balance among different sequences, while for small sequences like CVC colon DB, we simply keep all image frames in the sequence.
After extracting all frames, we carefully checked the generated dataset and manually removed some frames that contain misleading or unuseful information.For example, when there is a sharp movement of the camera, the captured images may be severely blurred, out of focus, or subject to significant illumination change, as shown in Figure 7.These images cannot be accurately labeled, so they are removed.While some less flawed frames are kept to improve the model's robustness under imperfect and noisy conditions.
Polyp classification only by visual examination is a big challenge, as reported in [71], the accuracy is normally below 70% even for experienced endoscopists.In clinical practice, the results have to be confirmed by further biopsy tests.However, since we only have video sequences, when the endoscopist could not reach an agreement on the classification results, we simply remove those sequences from the dataset, otherwise, the models may not learn the correct information for classification.Eventually, the dataset contains 155 video sequences (37,899 image frames) with the labeled ground truth of the polyp classes and bounding boxes.

Dataset Split
In order to train and evaluate the performance of different learning models, we need to divide the combined dataset into training, validation, and test sets.For most benchmark datasets for generic object detection, the split is normally based on images.However, this does not apply to the endoscopic dataset.Because all frames in one video sequence correspond to the same polyp, if we split the dataset at the image level, then the same polyp will simultaneously appear in the training, validation, and test sets.This will falsely increase the classification performance since the models have already seen the polyps to be tested during the training stage.Therefore, we split the dataset at the video level.
Since the final dataset is combined from four different datasets captured by different equipment with different data distribution.To increase the representativeness of the dataset, as well as the balance of the two classes of the polyps, we make the division for each dataset and polyp class independently.For each class in one dataset, we randomly select 75%, 10%, and 15% sequences to form the training, validation, and test sets, respectively.For example, the GLRC [71] has 41 videos, with 26 adenomatous and 15 hyperplastic sequences.We split the 26 adenomatous sequences and the 15 hyperplastic sequences independently according to the same ratio to guarantee the class balance in the final dataset.
In summary, we have generated 116 training, 17 validation, and 22 test sequences, with 28773, 4254, and 4872 frames, respectively, for each set.Some sample frames from the dataset are shown in Figure 8.For the training set, we combine all frames from the 116 sequences into one folder and shuffle them.While for the validation and test sets, we keep the sequence split in order to evaluate the model performance based on polyps (i.e., sequences).The details of the dataset organization are shown in Table 1.The dataset can be accessed from this https://doi.org/10.7910/DVN/FCBUOR.

Experiments
Using the generated dataset, we evaluated eight state-of-the-art object detection models, including Faster RCNN [52], YOLOv3 [56], SSD [64], RetinaNet [65], DetNet [66], RefineDet [68], YOLOv4 [59] and ATSS [69].To set the benchmark performance, three August 9, 2021 14/29 different experiment setups are tested: frame-based two-class polyps detection, frame-based one-class polyps detection, and sequence-based two-class polyps classification.The performance of the two frame-based detections is measured using regular object detection metrics.For the sequence-based classification, regular detection models will be applied to each frame.Then a voting process picks the mostly predicted polyp category as the final classification result.More specific details will be presented below.
The eight detection models are mostly proposed with good performance on generic object detection tasks.These models are adopted from the originally published setups, with slightly modified hyperparameters to optimize their performance on the polyp dataset.The hyperparameter setups are listed in Table 2.We employ the following three metrics to evaluate the performance of each model: precision, recall, and F-score.

P recision = True Positive True Positive + False Positive
• Precision measures the percentage of correct predictions.In polyp detection, it indicates the confidence in the prediction when a positive detection occurs.Higher precision can reduce the chances of a false alarm, which will cause the financial and mental stress of a client.
• Recall is the fraction of the objects that are detected.It is very important in polyps detection since a higher recall ensures more patients receive a further check August 9, 2021 15/29 and appropriate treatment in time.It can also reduce mortality and prevent excessive cost to patients.
• F-score takes both precision and recall into consideration.It measures a balanced performance of a model between false positive and false negative.

Frame-based Two-class Polyp Detection
This experiment predicts polyps for individual frames.It is a test of a model's localization and classification ability.The CNN models are trained using our training set that consists of a mix of frames from different video sequences.During the validation and test phase, we treat each frame individually and evaluate the performance.
Since the state-of-the-art CNN detectors have fast detection speed and can be implemented in real-time.This allows the endoscopists to find the lesions and provide category suggestions during colonoscopy.As human operators may suffer from fatigue and focus loss after long hours of work, this automated process could alert and assist the endoscopists to focus on suspected lesions and avoid miss detection.
To test the effectiveness of the proposed dataset with respect to a single dataset mentioned above, we also perform the frame-based two-class detection using a single dataset.In this controlled experiment, we train all the models trained using the KUMC dataset.Since this dataset contains a variety of more sequences and video frames than other datasets, it guarantees the convergence of all involved models.After training, we test the models on the same combined test set as in other experiments.As shown in the results, the performance of all models will drop significantly when trained using only a single dataset.This experiment verifies the effectiveness of the combined dataset.

Frame-based One-class Polyp Detection
This experiment has almost the same setups as the frame-based two-class polyp detection except for the class number.Hyperplastic and adenomatous polyps are treated as a single class polyps.For annotation files, instead of providing a separate set of annotation files, we read the same ground truth as the previous experiment, discard the information about polyp categories during training and inference time.
In colorectal cancer screening, it is more important to accurately detect whether polyps are developed than classifying polyp categories, because further screening and diagnosis are always followed after colonoscopy finds suspected lesions.This experiment aims to test whether a higher performance could be achieved by only localizing polyps in general.Without the more challenging task of classifying polyp categories, CNN models could be trained to extract more generalized features to distinguish polyps.Screening methods with higher precision like biopsy or polypectomy then could be followed to determine the categories of lesions.

Sequence-based Two-class Polyp Classification
This experiment adopts the same setup as Frame-based Two-class tests, however, we only make one prediction for each video sequence since it only contains the same polyp in the sequence.During the test period, it will generate the prediction based on individual frames at first, then we collect all results from every frame of a video sequence and classify the video based on the mostly predicted polyp category.Although there may have better ways to classify a video sequence such as based on the confidence score of the prediction for each frame, we only adopt the basic approach as a benchmark to see how much improvement we can achieve for sequence-based prediction.
August 9, 2021 16/29 Sequence-based classification is the practice of clinical application since all frames in the sequence are observing the same polyp from different viewpoints.It also has the potential to achieve better performance.To classify the polyp only from at frame is difficult, for example, the polyp may be partly occluded in some frames or appear small when viewing from a far distance.All these scenarios will make it hard to be accurately classified.However, in video-based classification, we are combining information from different viewpoints which can reduce the influence caused by those hard frames.Thus, at the clinic, the endoscopist usually takes the colonoscopic video from various viewpoints to ensure a reliable classification of the polyps.

Results and Analysis
In the experiments, the frame-based and sequence-based two-class detection and classification sharing the same CNN model.All hyperparameters for the compared models are summarized in Table 2.The final models chosen for the test are based on the validation performance.Precision, recall, and F1 scores are all calculated at the confidence threshold of 0.5 to ensure a fair comparison.The best performance CNN models are mostly produced before epoch 10.An exception is RefineDet one-class detection, with 130k iter equaling around 45 epochs.However, it has achieved similar validation performance, 88.05% mAP, as early as 30k iter compared to 88.12% at 130k iter.We suggest the best CNN model for polyps detection is usually generated at the earlier training stage.

Frame-based Two-class Polyp Detection
The results are shown in Table 3. Overall, all detectors have achieved better performance for adenomatous polyps since they are larger in size and their shape and texture are easier to distinguish from the colonic wall.RefineDet has achieved the best combined performance.It achieves the highest mean F1-score, mAP, and mean recall than all other models.YOLOv3 yields the best precision by sacrificing its recall, which is abnormally lower than other detectors.Figure 9 shows some examples of the detection results.We pick a confidence threshold of 0.5.As shown in the examples, the models are very confident about the predictions.They mostly have only one prediction with a confidence score over 0.5 on each frame.The predicted bounding boxes are very tight and precise on the lesions, which shows great potential in assisting colonoscopy practice.
To analyze the difference between recalls from YOLOv3 and other detectors at the confidence threshold of 0.5, we have plotted the count of true positives (TP) and false positives (FP) over different confidence scores.In Figure 10   We discard any predictions with a confidence score below 0.01 since they tend to be random predictions.
detectors.RefineDet and other detectors show a clear maximum peak for TP count at confidence > 0.9 and another weaker peak for confidence < 0.1.While YOLOv3 has fewer predictions with high confidence.Therefore, although YOLOv3 is conservative predictions have high accuracy, it misses a large proportion of lesions and results in its low recall.SSD yields the best adenomatous polyps detection recall, F1 score, and AP value.Overall, its mAP (67.6%) ranks the third, closely matching the most recent detector ATSS and leading the following detector, Faster RCNN with mAP of 57.7%, by a considerable margin.For the harder task of hyperplastic polyp detection, RefineDet yields the highest scores for recall, F1, and AP.These results show that SSD-based detectors, SSD, RetinaNet, and RefineDet, are generally doing well in detecting polyps.RefineDet, by roughly adjusting anchors first, obtains better localization knowledge before generating final predictions.Faster RCNN has a similar two-step architecture.Therefore, it also has decent performance.This indicates the possibility to improve polyp detection performance by adding more refined location information before making

Generalizability and Comparison with Previous Dataset
The generalization ability refers to the adaptivity of the trained models to new, previously unseen data.This is very crucial in practical applications since the test images may have different distributions from the ones used to create the model.In order to test if the newly generated dataset can increase the generalizability of the trained models, we compare our results with the models only trained on a single dataset.We conduct the frame-based two-class polyp detection only on a single dataset, the KUMC dataset.The models are trained using the images from KUMC and tested on the full combined test set as in other experiments that consist of frames from different datasets.The results of different models are shown in Table 4.We can see that, on average, the performance is dropped by 8%, when we compare the results in Table 3 where all models are trained using the proposed dataset.The performance drop is mainly caused by the representativeness and the number of training samples.Although KUMC contains more variable sequences and frames than the other datasets combined, the color and illumination of different datasets may differ greatly, as shown in Figure 5. Therefore, the models trained on a single dataset may suffer poor generalization.

Frame-based One-class Polyp Detection
The results for detection only without classification are shown in Table 5.We can see YOLOv3 achieves the highest precision among all detectors, which is consistent with the two-class results.With a reasonable recall, it also yields a high F1 score.Compared to its two-class detection performance, it is evidence that YOLOv3 is better at detecting than classifying polyps.YOLOv3 generates classification scores and bounding box adjustments at the same time.Since classification performance is based on the anchor information, YOLOv3's original anchors might not contain sufficient portions of a polyp due to its small size.We suggest that refined location information is more important for distinguishing polyp categories than for locating them.
Table 6 shows the detailed localization results for adenomatous and hyperplastic polyps.We can see that Faster RCNN achieves the best recall, which is the most important metric in clinical settings.For adenomatous polyps, Faster RCNN achieves 93.3% recall, on a par with recent clinical screening results.It is one of the only three detectors (with RefineDet and ATSS) that achieve over 80% recall for the hyperplastic polyps.Recall that in the above two-class detection, Faster RCNN also achieves the top three recall scores.Thanks to the region proposals, two-stage detectors usually have more chance to detect the polyps.While YOLOv3 also achieves competing performance in one-class detection.It yields the highest precision with reasonable recall score.
RefineDet still yields the best overall performance with the highest F1 score and AP.We also evaluated the inference time of different models in frame-based one-class detection.All models are evaluated on an NVIDIA TESLA P100 GPU.As shown in Table 5, The single-stage detectors (SSD, YOLOv3, RetinaNet, and RefineDet) are faster than the two-stage detectors (Faster RCNN and DetNet).SSD and YOLOv3 achieve the fastest inference time as 17ms, which is over 60 frames per second (fps).However, even for the slowest model DetNet, it still achieves 64ms, which is above 15 fps.Please note that a deeper backbone network will require more inference time than a shallower backbone network.For example, RetinaNet with ResNet-50 increases the inference time to 61ms from 17ms for SSD with VGG-16.

Sequence-based Two-class Polyp Classification
From Table 7, we can see that both SSD, DetNet and YOLOv4 stand out in terms of precision, recall, and F1 score.This means that they are better at predicting correct polyp categories.Another interesting observation is that, although some detectors produce more consistent results for different frames in the same sequence, they do not yield higher precision.It becomes obvious when we plot the percentage of the dominant predicted category in each video sequence in Figure 11.We only show the plots for RetinaNet and RefineDet as examples.DetNet, FasterRCNN, and RetinaNet are not very consistent in predicting the polyp class for some of the video sequences, with close to 50% dominant class.This means the predictions are not robust with only a few frames to swing the result.RefineDet, SSD, YOLOv4, and ATSS, on the other hand, are relatively more robust in predicting the polyp class with most sequences above 70%.

Conclusion
In this paper, we have developed a relatively large endoscopic dataset for polyp detection and classification.We have also evaluated and compared the performance of eight state-of-the-art deep learning-based object detectors.Our results show that deep CNN models are promising in assisting CRC screening.Without too much modification, general object detectors have already achieving adenomatous polyps detection sensitivity of 91% in the one-class detector and around 70% precision in the classification task.Among all the detectors we have tested, YOLOv4, ATSS, and RefineDet perform relatively well in all tests with balanced precision and recall scores and consistent results for the same lesions.Our experiments also show the refinement of location information before classification will effectively boost the performance.
This study can serve as a baseline for future research in polyp detection and classification.The developed dataset can serve as a standardized platform and help August 9, 2021 22/29

Figure 1 .Fig 1 .
Fig 1. Faster R-CNN structure.Region proposal network (RPN) shares the same base CNN with a fast R-CNN network.The region proposal is generated by sliding a small convolutional network over the shared feature maps, and these proposals are used to produce final detection results.

Fig 2 .
Fig 2. SSD structure.Base network is truncated from a standard network.The detection layer computes confident scores for each class and offsets to default boxes.

Fig 3 .
Fig 3. DetNet structure.The diagram shows the basic building block of ResNet [55] and DetNet [66].(a) After each ResNet block, the resolution is reduce in half.(b) The DetNet preserves the feature map resolution and increases the receptive field by using dilated convolutions.

Fig 5 .
Fig 5. Sample frames from different colonoscopy.(a) has a higher resolution and a warm color temperature; (b) has lower resolution and a green tone; (c) is more natural in color tone but has a transparent cover around the frame edges.

Fig 6 .
Fig 6.A colonoscopy sequence.From frame 1 to frame 146, the camera shows unnoticeable movement.

Fig 7 .
Fig 7. Some bad examples of colonoscopy frames

Fig 8 .
Fig 8. Six sample frames from the generated dataset.

Fig 9 .
Fig 9. Three examples of the detection results with the predicted classes and confidence scores , we only show the charts from RefineDet and YOLOv3 since RefineDet has similar patterns as other four August 9, 2021

Fig 10 .
Fig 10.True positive (green plot) and false positive (red plot) count w.r.t.confidence.We discard any predictions with a confidence score below 0.01 since they tend to be random predictions.
All SSD-based detectors perform almost equally well.The focal loss of RetinaNet does not show significant improvement on the original SSD model.DetNet does not show August 9, 2021 20/29

Fig 11 .
Fig 11.Percentage of the dominant class.Detectors predict the polyp category in each individual frame.The category with more than 50% of all frames is the dominant category for that video sequence.The charts show the percentage of frames classified as the dominant class in each test sequence.(ad) and (hp) on the bottom means ground truth class adenomatous and hyperplastic respectively.Correct predictions are in green and misclassifications are in red.

Table 3 .
Results for Frame-based Two-class Polyp Detection

Table 4 .
Result for training on KUMC and testing on the full combined test set

Table 5 .
Result for Frame-based One-class Polyp Detection

Table 6 .
Frame-based One-class Polyp Detection Results for each Class

Table 7 .
Result for Sequence-based Two-class Polyp Classification Faster RCNN, however, it makes the detector more balanced by increasing the precision by 20%+, resulting in a better F1 score.