Comparison of diagnostic performance between convolutional neural networks and human endoscopists for diagnosis of colorectal polyp: A systematic review and meta-analysis

Prospective randomized trials and observational studies have revealed that early detection, classification, and removal of neoplastic colorectal polyp (CP) significantly improve the prevention of colorectal cancer (CRC). The current effectiveness of the diagnostic performance of colonoscopy remains unsatisfactory with unstable accuracy. The convolutional neural networks (CNN) system based on artificial intelligence (AI) technology has demonstrated its potential to help endoscopists in increasing diagnostic accuracy. Nonetheless, several limitations of the CNN system and controversies exist on whether it provides a better diagnostic performance compared to human endoscopists. Therefore, this study sought to address this issue. Online databases (PubMed, Web of Science, Cochrane Library, and EMBASE) were used to search for studies conducted up to April 2020. Besides, the quality assessment of diagnostic accuracy scale-2 (QUADAS-2) was used to evaluate the quality of the enrolled studies. Moreover, publication bias was determined using the Deeks’ funnel plot. In total, 13 studies were enrolled for this meta-analysis (ranged between 2016 and 2020). Consequently, the CNN system had a satisfactory diagnostic performance in the field of CP detection (sensitivity: 0.848 [95% CI: 0.692–0.932]; specificity: 0.965 [95% CI: 0.946–0.977]; and AUC: 0.98 [95% CI: 0.96–0.99]) and CP classification (sensitivity: 0.943 [95% CI: 0.927–0.955]; specificity: 0.894 [95% CI: 0.631–0.977]; and AUC: 0.95 [95% CI: 0.93–0.97]). In comparison with human endoscopists, the CNN system was comparable to the expert but significantly better than the non-expert in the field of CP classification (CNN vs. expert: RDOR: 1.03, P = 0.9654; non-expert vs. expert: RDOR: 0.29, P = 0.0559; non-expert vs. CNN: 0.18, P = 0.0342). Therefore, the CNN system exhibited a satisfactory diagnostic performance for CP and could be used as a potential clinical diagnostic tool during colonoscopy.

All article sections were carefully reviewed. Subsequently, bibliographies of the retrieved articles were screened to identify any potential source of relevant studies.

Study selection
The inclusion criteria included (1) studies that included patients with CP; (2) colonoscopy was performed to detect or classify colorectal polyps; (3) CNN system was applied to improve the diagnostic performance of colonoscopy; (4) precise diagnostic data were presented in the article; (5) if the colorectal polyps were classified, the final pathology results were provided. On the other hand, the exclusion criteria included (1) the types of articles were abstracts, reviews, letters, comments, and case reports; (2) precise data were unavailable in the article; (3) animal studies and non-English publications.

Data extraction and quality assessment
A total of 2 independent researchers (Ye and Xi) conducted the data extraction from the included studies. The information of enrolled studies included the first author's name, publication year, country, diseases concerned, training material, testing material, types of diagnostic performance, and diagnostic performance of the CNN system, expert, and non-expert. The diagnostic performance was categorized as true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN) and was retrieved from each article. Moreover, if there was any inconsistency between the 2 reviewers (Ye and Xi), a discussion was conducted including a third investigator to resolve the problem.
Renner et al. [15], provided data of "standard-confidence predictions" and "high-confidence predictions". We found that including both of them might introduce the potential of duplication of data. After careful consideration, the "standard-confidence predictions" of data were included. Guo et al. [16] provided data of per-frame and per-video and data of per-frame was selected. This was because, first, nearly all of the articles enrolled for analysis used colonoscopy images instead of videos. To ensure the consistency of the whole analysis, the data of perframe was selected; secondly, the authors did not provide enough per-video data for analysis. Additionally, Wang et al. [12], used 4 datasets to validate the diagnostic performance of the CNN system. However, precise data were only provided in Dataset A. As a result, this study chose to include the data of Dataset A. Kudo et al. [17], provided both white-light (WLI) and narrow-band image (NBI) for each lesion and tested CNN system in different imaging models. Including both of WLI and NBI images might cause duplication of data. As a result, we deleted the data of NBI images. However, Renner et al. [15], Kudo et al. [17], and Ozawa et al. [18] have the data of diminutive CPs. We thought it was not appropriate to add theme to the general analysis for the potential risk of duplication of data. We initially wanted to perform a subgroup analysis for them, but the STATA software could not do any analysis with sample size smaller than 4.
The methodological quality and applicability of the studies included were evaluated using the quality assessment of diagnostic accuracy scale-2 (QUADAS-2) [19].

Outcomes of interests
First, pooled sensitivity, specificity, and other diagnostic indices were calculated based on the value of TP, FP, TN, and FN, among CNN system, expert, and non-expert. Secondly, the diagnostic odds ratio (DOR) and the area (AUC) under the summary receive operating characteristic (SROC) curve, which represented overall diagnostic performance, were examined and compared among different groups. Finally, to identify whether the differences in diagnostic performance were statistically significant, the relative diagnostic odds ratio (RDOR) was compared between each of the 2 groups (CNN system vs. expert; CNN system vs. nonexpert; expert vs. non-expert).

Statistical analyses
Statistical analyses were performed to establish the diagnostic efficacy. The sensitivity, specificity, positive likelihood ratio (PLR), negative likelihood ratio (NLR), DOR, the AUC of SROC, and RDOR were pooled with their 95% confidence interval (CI). A diagnostic tool was considered to have a strong diagnostic value, if its PLR was above 5 and NLR was below 0.2 [20]. The heterogeneity among studies was evaluated by Cochran Q and Higgins' I 2 statistics [21]. If the value of I 2 was more than 50%, and the value of P less than 0.05, indicating statistically significant heterogeneity existed, a random-effect model was selected for pooling the data [22]. Otherwise, a fixed-effect model was utilized.
RDOR was compared between each 2 groups to identify statistically significant differences of diagnostic performance and was based on multivariate meta-regression analysis [25,26]. The Deeks' funnel plot was used to assess the publication bias.
Pooled sensitivity, specificity, accuracy, PLR, NLR, DOR, and AUC of SROC were calculated using Stata version 14.0. QUADAS-2 assessment was performed using Review Manager version 5.3. The result with a P-value of less than 0.05 (p<0.05) was considered statistically significant.

Search strategy
Following the initial search through the different databases, a total of 189 articles were identified (102 in PubMed, 31 in Web of Science, 12 in Cochrane Library, and 44 in EMBASE). First, 146 duplicate studies were removed and the remaining 43 articles were screened. In total, 15 articles, including non-English publications, reviews, abstracts, and case reports, which did not meet the inclusion criteria were excluded. Subsequently, the articles with imprecise data and irrelevant subjects were excluded after full-text articles were assessed. Eventually, 13 studies were enrolled in this meta-analysis [12][13][14][15][16][17][18][27][28][29][30][31][32] (Fig 1). PRISMA flow diagram and checklist are shown in S1 and S2 Tables, respectively.

Cohort characteristics and quality of included studies
Among the enrolled studies, 7 focused on the field of CP detection, while other studies focused on the field of CP classification. Among these, 5 studies conducted in Japan, 4 in China, 1 in Germany, one in the USA, 1 in Norway, and 1 in Canada, respectively. All articles were published in the last four years (Table 1). Meanwhile, all studies included precise data about the diagnostic performance of the CNN system; 4 studies provided precise data about the performance of experts, and 3 studies provided precise data on the performance of non-expert. All the data about the diagnostic performance of human endoscopists are in the field of CP classification. Histological examination results were the golden standard in the studies done about CP classification. The diagnostic performance was categorized as TP, FP, FN, and TN ( Table 2).
Based on the QUADAS-2 assessment, the quality of all 13 studies included was considered moderate (Fig 2). A total of 11 studies were considered high-quality with low risk in at least 5 of the 7 QUADAS-2 domain. For the patient selection domain, 2 studies introduced bias because case-control design was avoided [18,27]. Moreover, 3 studies showed a high concern for applicability [13,17,18]. Subsequently, for the index test domain, 2 studies had a high risk of bias [13,17]. Finally, there was only one study that had a high concern regarding reference standard applicability [14].

Application in the field of colorectal polyp detection diagnostic performance of CNN system
The results of the diagnostic performance of the CNN system are shown in  are as shown in Table 3. Moreover, the PLR and NLR results of the CNN system confirmed that it is an effective method for detecting colorectal polyps.
Subgroup analysis without the data of short or full videos. The study of Guo et al. [16] included data of videos, and the sample size was large. Considering including it might mislead the general result, we chose to perform a subgroup analysis without it. Application in the field of colorectal polyp classification Diagnostic performance of CNN system. First, the pooled sensitivity and specificity were 0.943 (95%CI: 0.927-0.955) and 0.894 (95%CI: 0.631-0.977) (Fig 4). The heterogeneity of sensitivity (I 2 = 94.77, P = 0.00) and specificity (I 2 = 98.91, P = 0.00) were significant. Meanwhile, the PLR, NLR, DOR, and AUC of SROC  0.93), respectively. The heterogeneity of sensitivity (I 2 = 91.38, P = 0.00) and specificity (I 2 = 88.75, P = 0.00) were significant. All data is summarized in Table 3. The comparison of diagnostic performance among CNN system, expert, and nonexpert. For CP classification, the AUC of SROC of CNN, expert, and non-expert was 0.95 (95%CI: 0.93-0.97), 0.96 (95%CI: 0.94-0.98), and 0.90 (95%CI: 0.87-0.93), respectively (Fig 5). By comparing them in pairs acording to RDOR, we found the diagnostic performance of CNN is comparable to that of the expert, but significantly better than that of the non-expert. (Table 4).

PLOS ONE
Comparison of diagnostic performance between convolutional neural networks and human endoscopists

Publication bias and identification of sources of heterogeneity
According to Deeks' funnel plot asymmetry, no publication bias was reported in pooled results of the CNN system. For CP detection, the result was P > |t| = 0.430. At the same time, for CP classification, the result was P > |t| = 0.196. They are as shown in Fig 6A and 6B. Since notable heterogeneity was observed in the pooled analysis of the CNN system in the field of CP detection and classification, meta-regression was conducted to identify the source of heterogeneity. Nonetheless, no potential sources of heterogeneity were identified.

Discussion
This work systematically reviewed the current status of the CNN system applied in the field of CP detection and classification. Moreover, we conducted a quantitative comparison of the diagnostic value between the CNN system and human endoscopists. Our major finding was that the diagnostic performance of the CNN system was comparable to that of the expert in the field of CP classification. In contrast, the performance of the CNN system was significantly superior to that of the non-expert.
The American Society of Gastrointestinal Endoscopy published the Preservation and Incorporation of Valuable Endoscopic Innovations (PIVI) statement in 2015 to address the resect and discard strategy [33]. This approach set the threshold of a diagnose-and-leave strategy for small colorectal polyps at NPV�90%. At the same time, the threshold of a resect-and-discard strategy was above 90% of the agreement with histopathology for post-polypectomy  surveillance intervals [34]. These set standards were significantly high and hard to achieve, even for experienced endoscopists. Besides, the task of endoscopists was time-consuming as well as labor-intensive. A few studies have shown that endoscopic detections and predictions triggered a rather low diagnostic accuracy rate, particularly in the case of non-expert use [35,36]. Hence, this calls for the application and use of technological support. This is because evidence has ascertained that computer-aided diagnosis of endoscopic images using AI has the potential to surpass the diagnostic accuracy of trained specialists. Also, AI might also provide more accurate results without interobserver differences, especially between experts and nonexperts.
A considerable number of studies have currently focused on the development of the CNN system that assisted human endoscopists. In the field of colonoscopy, its function is primarily divided into 2 categories, i.e.: detection and classification. For CP detection, we found that the PLR, NLR, and AUC of the CNN system was 8.911 [95%CI: 2.110-37.622], 0.064 [95%CI: 0.043-0.094], and 0.95 [95%CI: 0.93-0.97], respectively. These results suggested that CNN was a good diagnostic tool for CP detection. Guo et al. [16] provided the data of videos with large sample size. Considering including them would add potential risk to mislead the general result of CNN, we subsequently performed a subgroup analysis without them. The result just slightly changed which meant it was stable with or without the data of Guo et al. [16].
Unfortunately, we didn't find data on human endoscopists in the field of CP dectection. However, some studies demonstrated that non-expert endoscopists could produce a better diagnostic performance during endoscopy after the AI training course [37,38]. Hence, the AI technologies harbor the application potential as a clinical ancillary diagnostic tool and also as an endoscopist training method.
Furthermore, it would be highly beneficial if endoscopic observation can distinguish neoplastic CP from hyperplastic CP. This is because the removal of lesions without malignant potential is expensive and causes high post-procedure complications [39]. Thus, a precise classification of CP significantly improves the cost-effectiveness of colonoscopy. Nonetheless, the task of precisely classifying the different types of CP remains rather difficult. For instance, lesions with indistinct borders, flat and depressed features in conventional adenomas are challenging to distinguish from surrounding normal mucosa. This scenario is specifically prevalent when the bowel preparation is inadequate or the mucosa is capped by mucus or intestinal residue [11]. Kuiper et al. revealed that the sensitivity/specificity of classification of diminutive CP was only 77.0%/78.8%, which was far from satisfactory [40]. In this study, we found that the sensitivity/specificity of a non-expert in the field of CP classification was 85.9%/81.1%. As such, the benefits of optical CP classification might remain limited to experts. However, not every endoscopist is an expert. Therefore, the emergence of AI technology has significantly resolved this limitation. Further, we discovered that the diagnostic performance of the CNN system was significantly better than that of the non-expert. However, due to the complexity of classification technology, the DOR of the CNN system applied in the field of CP classification (139.052 [95%CI: 22.978-449.202]) was weaker compared to that in the field of CP detection (152.325 [95%CI: 51.654-449.202]). Alaso, a similar CNN-DL system was used for the diagnosis and classification of proximal gastric precancerous conditions, including chronic atrophic gastritis, intestinal metaplasia, and dysplasia [41]. This system achieved a sensitivity of 93.5%, and an accuracy of 98.3%, which were much better than both the less and more experienced endoscopists.
However, the CNN system was, in essence, a type of algorithm, which could not make logical decisions like humans. It can be used as a training or auxiliary tool to enhance the First, most of the images and videos extracted for CNN system training are highly qualified, which usually triggers selection bias. These systems are frequently unable to distinguish lesions from low-quality materials. Also, their diagnostic performance is excellent in the training set but weak in the clinical practice.
Secondly, identification of images and videos of rare lesions including subtle flat colonic lesions and morphology types is challenging. They are insufficient in either hospitals' independent or online databases, hence, inadequate training of the CNN system. This further triggers high misdiagnosis rates of infrequent diseases.
Thirdly, most studies included in the present review trained their CNN systems with stationary images or image frames extracted from colonoscopic videos which might hinder the ability of real-time implementation of the CNN system. Moreover, due to the lack of calculating power of computer processors and the complexity of technical processes, the latency of the decision-making process in most systems was unsatisfactory, subsequently disturbing the endoscopist during colonoscopy. Therefore, the ability to work in real-time during endoscopy should be incorporated.
Finally, the CNN system and other artificial intelligence are typically types of algorithm which make decision based on past information. This means it cannot make logical or X crossed decisions. Notably, AI excels when data and training are abundant and exhaustive. However, its performance becomes poorer when it faces previously unseen features and objects since it struggle to extrapolate knowledge gathered from the past to the new environment [42]. In this scenario, humans appear to perform better than AI [43].
With the rapid advancements of AI technology, an ideal CNN system will be developed to overcome these limitations. It might precisely distinguish different lesions from normal surrounding mucosa, including those rare lesions. Meanwhile, it might assist endoscopists simultaneously during endoscopy with almost undetectable latency. Even more, it might provide the type, location, size, depth, and other relevant information of lesions.
In the present study, there are some limitations that should be acknowledged here. First, studies on this field are limited since the application of the CNN system in the field of endoscopy has not matured. Secondly, the sample size of the comparison between the CNN system and human endoscopists was small, which might cause selection bias. Thirdly, although there was no publication bias, since letters, reviews, as well as articles not published in English were excluded, selective reporting bias might still exist. Fourthly, although meta-regression analysis was performed to identify the potential sources of heterogeneity, due to the limitation of the sample size and the variables collected from studies included, the exploration of heterogeneity might remain inadequate. Finally, the majority of studies included were retrospective and used different types of training and testing materials, hence a potential bias.

Conclusion
In conclusion, our systematic review and meta-analysis suggested that the CNN system achieved comparable diagnostic performance to that of an expert, and better performance compared to that of a non-expert, in the field of CP detection. Additionally, in the field of CP classification, the CNN system demonstrated better diagnostic performance than the human endoscopists regardless of the level of working experience. Despite the limitations of the CNN system, it can be popularized in clinical practice with relative-high diagnostic accuracy, consequently enhancing the diagnostic performance of endoscopists.
Supporting information S1