The development of medical assisting tools based on artificial intelligence advances is essential in the global fight against COVID-19 outbreak and the future of medical systems. In this study, we introduce ai-corona, a radiologist-assistant deep learning framework for COVID-19 infection diagnosis using chest CT scans. Our framework incorporates an EfficientNetB3-based feature extractor. We employed three datasets; the CC-CCII set, the MasihDaneshvari Hospital (MDH) cohort, and the MosMedData cohort. Overall, these datasets constitute 7184 scans from 5693 subjects and include the COVID-19, non-COVID abnormal (NCA), common pneumonia (CP), non-pneumonia, and Normal classes. We evaluate ai-corona on test sets from the CC-CCII set, MDH cohort, and the entirety of the MosMedData cohort, for which it gained AUC scores of 0.997, 0.989, and 0.954, respectively. Our results indicates ai-corona outperforms all the alternative models. Lastly, our framework’s diagnosis capabilities were evaluated as assistant to several experts. Accordingly, We observed an increase in both speed and accuracy of expert diagnosis when incorporating ai-corona’s assistance.
Citation: Yousefzadeh M, Esfahanian P, Movahed SMS, Gorgin S, Rahmati D, Abedini A, et al. (2021) ai-corona: Radiologist-assistant deep learning framework for COVID-19 diagnosis in chest CT scans. PLoS ONE 16(5): e0250952. https://doi.org/10.1371/journal.pone.0250952
Editor: Gulistan Raja, University of Engineering & Technology, Taxila, PAKISTAN
Received: February 10, 2021; Accepted: April 17, 2021; Published: May 7, 2021
Copyright: © 2021 Yousefzadeh et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data cannot be shared publicly because of ethical restrictions. CT scan data of this research from the Masih Daneshvari Hospital, including images and results, are anonymous and non-personally identifiable, due to sensitive human study participant data. Data are available from the Institutional Data Access / Ethics Committee of Masih Daneshvari Hospital (contact via firstname.lastname@example.org) for researchers who meet the criteria for access to confidential data. The MDH cohort is also available upon request to the corresponding author’s email address. The CC-CCII set and the MosMedData cohort are publicly available at http://ncov-ai.big.ac.cn/download and https://mosmed.ai/en/, respectively.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Since the beginning of 2020, the novel Coronavirus Disease 2019 (COVID-19) has widely spread globally. As of 2021/04/29 08:42:32, there are more than 118 million reported cases and 2.5 million deaths . Patients infected with COVID-19 commonly display symptoms such as fever, cough, fatigue, breathing difficulties, and muscle ache [2–4]. Vaccination has started in many countries since early 2021, but has been facing many challenges .
Currently, the most common method of testing for COVID-19 is Real-Time Polymerase Chain Reaction (RT-PCR) to detect viral nucleotides from upper respiratory specimens obtained by nasopharyngeal, oropharyngeal, or nasal mid-turbinate swab . It has been shown that RT-PCR has several drawbacks. Reports suggest that since oropharyngeal swabs tend to detect COVID-19 less frequently than nasopharyngeal swabs, RT-PCR tends to have a high false-negative rate. Furthermore, RT-PCR has demonstrated a decrease in sensitivity to below 70% due to a low viral nucleic acid load and inefficiencies in its detection. This might be caused by immature development of nucleic acid detection technology, variation in detection rate by using different gene region targets, or a low patient viral load . Besides, the availability of test kits and expert personnel to take them is still suboptimal in some countries. Not to mention the extended time period for the test completion contributes to ruling out RT-PCR as a reliable early detection and screening method [8–10]. In contrast to RT-PCR, diagnosis from other measurements such as chest Computed Tomography (CT) and blood factors is shown to be an effective early detection and screening method with high sensitivity in both detection  and anticipation of the severity of the disease .
Chest CT scan of a COVID-19 infected patient reveals bilateral peripheral involvement in multiple lobes with areas of consolidation and ground-glass opacity that progresses to “crazy-paving” patterns as the disease develops . Asymmetric bilateral subpleural patchy ground-glass opacities and consolidation with a peripheral or posterior distribution, mainly in middle and lower lobes, are described as the most common image finding of COVID-19 . To elaborate more, additional common findings include interlobular septal thickening, air bronchogram, and crazy paving pattern in the intermediate stages of the disease . The most common pattern in the advanced stage is subpleural parenchymal bands, fibrous stripes, and subpleural resolution. Nodules, cystic change, pleural effusion, pericardial effusion, lymphadenopathy, cavitation, CT halo sign, and pneumothorax are some of the uncommon but possible findings [11, 14]. Recent studies indicate that organizing pneumonia, which occurs in the course of viral infection, is pathologically responsible for the clinical and radiological manifestation of Coronavirus pneumonia .
Deep learning is an area of Artificial Intelligence (AI) that has demonstrated tremendous capabilities in image feature extraction and has been recognized as a successful tool in medical imaging-based diagnosis, performing exceptionally with modalities such as X-Ray, Magnetic Resonance Imaging (MRI), and CT [15–21]. Recently, the research of AI-assisted respiratory diagnosis, especially pneumonia, has gained a lot of attention. One of the well-established standards in this research is the comparison of AI with expert medical and radiology professionals. As a pioneering work in this field , introduced a radiologist-level deep learning framework trained and validated on the ChestX-ray8 dataset  for the detection of 14 abnormalities, including pneumonia, in chest X-Ray images, which was further developed to propose a deep learning framework with pneumonia detection capabilities equivalent to that of expert radiologists . Moreover , introduced a novel dataset of chest X-Ray images annotated with 14 abnormalities (7 the same as ChestX-ray8) and a state-of-the-art deep learning framework. Lastly , proposed a deep learning framework with a feature extractor based on AlexNet  to create a model capable of accurately diagnosing knee injuries from MRI scans and further showcases the positive impact of AI assistance in expert diagnosis.
In COVID-19 related research , has reported a sensitivity of 0.59 for RT-PCR test kit and 0.88 for CT-based diagnosis for patients with COVID-19 infection, and a radiologist sensitivity of 0.97 in diagnosing COVID-19 infected patients with RT-PCR confirmation. Furthermore , introduces a deep learning framework with a 0.96 AUC score in the diagnosis of RT-PCR confirmed COVID-19 infected patients. Zhang et al. proposed a model that on a dataset of 4154 subjects achieved an AUC score of 0.98 for diagnosing COVID-19 from two other classes; Normal and CP (Common Pneumonia i.e. non COVID-19 viral and bacterial pneumonia). They further made their dataset, CC-CCII , publicly available. In addition, the model proposed by Jin et al., developed on a dataset of 9025 subjects, which is an amalgamation of their own data and several other public datasets (e.g. LIDC–IDRI , Tianchi-Alibaba , MosMedData , and CC-CCII), gained an accuracy of 0.975 for diagnosing between COVID-19 and three other classes (non-pneumonia, non-viral community-acquired pneumonia, Influenza-A/B), 0.921 for between COVID-19 and the CP and Normal classes on the CC-CCII dataset, and 0.933 for between COVID-19 from non-pneumonia on the MosMedData cohort. Further, this work manages to astoundingly diagnose between COVID-19 and influenza type-A, which is surprising given the small amount of influenza data in their study.
In this paper, we present ai-corona, a radiologist-level deep learning framework for COVID-19 diagnosis in chest CT scans. Our framework was developed on a set of 7184 lung CT scans from 5693 subjects, for which 2032 subjects are from the Masih Daneshvari Hospital (MDH) cohort and the rest belong to the CC-CCII set and MosMedData cohort. This data was gathered from three countries; China, Iran, and Russia. In this work, our framework diagnoses between COVID-19 and CP (common pneumonia), NCA (non COVID-19 abnormal), non-pneumonia, and Normal classes. We evaluate and compare the performance of ai-corona with experts and RT-PCR in COVID-19 diagnosis and further compare our framework with AI models proposed by Zhang et al.  and Jin et al. . Finally, we examine the impact of AI as assistance to expert diagnosis.
In short, the main advantages and novelties of this study are as follows:
- Introducing a comprehensive and authentic methodology for annotating the dataset cases for such work, especially the COVID-19 infection, for the MDH dataset.
- Proposing a deep learning framework that is capable of accurately diagnosing chest CT scans for COVID-19, while being robust to the number of slices in the scan and having a low computational load.
- Thorough evaluation of the diagnosis performance of ai-corona on multiple datasets and comparing to radiologists, RT-PRC, and two other similar works.
- Evaluating and elucidating the impact of ai-corona’s assistance on radiologists’ diagnosis performance.
Materials and methods
Three datasets were employed in this work; The MDH cohort, the CC-CCII set, and the MosMedData cohort. An overall summery of all the data employed in our work can be found in Table 1.
The first dataset was obtained by our group from patients hospitalized at the Masih Daneshvari Hospital (MDH) (Tehran, Iran). The cascade structure of this cohort can be found in S1 Fig. This cohort consists of 2121 lung CT scans from 2032 subjects annotated into 3 classes: (1) Normal; (2) Non-COVID Abnormal (NCA); and (3) COVID-19. Since differentiating between COVID-19 and Normal classes is easier than between COVID-19 and NCA (especially if there are similar imaging features), having the NCA class is very important, as it includes abnormalities such as atelectasis, cardiomegaly, lung emphysematous, hydropneumothorax, pneumothorax, cardiopulmonary edema, cavity, fibrocavitary changes, fibrobronchiectatic, mass, and nodule. Using the search function of the hospital’s PACS and by reviewing reports by two board-certified radiologists, we gathered a preliminary dataset with a balanced distribution over all three classes.
All the participants in the MDH cohort gave written consent and our work has received the ethical license of IR.SBMU.NRITLD.REC.1399.024 from the Iranian National Committee for Ethics in Biomedical Research.
Cases in the Normal and NCA classes are from prior to the start of the Coronavirus global pandemic. A subset of the data in these two classes was randomly selected for testing. This portion was re-annotated by a different expert radiologist. Only the cases with consistent labels (i.e. same label as in the initial report) were retained in the test set. The MDH Normal and NCA cases that were not included in the test subset were further divided randomly into a training subset and a tuning subset.
The MDH COVID-19 group scans for testing were taken in the early stages of the infection and included 119 lung CT scans from 109 patients hospitalized for more than three days. These scans were selected by the consensus of several metrics that indicate COVID-19 infection: (1) report by at least one radiologist on the scan; (2) confirmation of infection by two pulmonologists; (3) clinical presentation; and (4) RT-PCR report.
Furthermore, unlike other works that take a positive RT-PCT as the sole criterion to annotate a case with COVID-19 label, and since our evaluation includes comparing the diagnosis performance of ai-corona with experts and RT-PCT, we clearly could not use a dataset that was annotated solely based on RT-PCR test result. Our annotation strategy is, therefore, more comprehensive and incorporates additional available metadata.
The MDH COVID-19 training (1518 subjects, 1590 scans) and tuning (168 subjects, 174 scans) sets were annotated using the aforementioned reports by the two radiologists.
The CT scans in the MDH cohort contained between 21 to 46 slices acquired in axial orientation with a slice thickness between 8 and 10 mm, The histogram representation for the number of slices is indicated in S2(a) Fig, while S2(b) and S2(C) Fig illustrate the age and sex distribution of the MDH cohort.
Moreover, as the NCA class of the MDH cohort includes many samples with non COVID-19 pneumonia, we can take this class as the equivalent of the CC-CCI set CP class for our model’s training.
The second dataset employed in this work was the publicly available CC-CCII dataset . After quality control (e.g. removing non-standard scans such as those with a small number of slices), this set contains 3953 CT scans from 2551 subjects. The scans in CC-CCII are annotated into three classes: Normal, Common Pneumonia (CP), and COVID-19. This CC-CCII dataset was randomly split into three subsets for: (1) training (2069 subjects, 3206 scans), (2) tuning (230 subjects, 352 scans), and (3) testing (252 subjects, 395 scans). The tuning subset was used for model checkpoint and selection of the best overall model.
The third dataset, MosMedData cohort, is also publicly available and is comprised of 1110 CT scans from 1110 subjects. This dataset is annotated into two classes: Non-pneumonia and COVID-19. We used the entire MosMedData cohort for external testing, that is, testing on a dataset that has not been used for model training or tuning. To evaluate our model on this cohort, we take the prediction of the COVID-19 class (for binary classification).
The public datasets LIDC–IDRI31 and Tianchi-Alibaba32 (which were used for the training of the model proposed by Jin et al. ) were not used in our framework’s development, as these sets are for benign and malignant tumor diagnosis and they might introduce uncertainties to our framework.
For the RT-PCR evaluation set, 2672 subjects, each hospitalized for more than three days, were tested 6419 times between February to October 2020. Respiratory samples including pharyngeal swabs/washing were obtained from the subjects. Nucleic acid was extracted from the samples using a QiaSymphony system (QIAGEN, Hilden, Germany) and SARS-CoV-2 RNA was detected using primer and probe sequences for screening and conformation on the basis of the sequence described by . An RT-PCR diagnosis is considered correct when a patient has at least one positive test result.
This project has received the ethical license of IR.SBMU.NRITLD.REC.1399.024 from the Iranian National Committee for Ethics in Biomedical Research.
For all the image slices, the top 0.5% of pixels with the highest values were selected and their values were clipped to the lowest one in the range. Then, the intensities were linearly transformed to the range [0, 255]. Since we utilize models pre-trained on the ImageNet dataset , an additional ImageNet normalization was also carried out.
We also opted to not perform any segmentation (i.e. patch extraction) in our pre-processing. This is due to the manual annotation of each dataset (like Jin et al. ) being time and resource consuming. On the other hand, using automated methods, such as image processing techniques and pre-trained segmentation deep learning models, would introduce further unwanted error and uncertainty to our data, and subsequently, to the model’s inference.
Deep learning method
Inspired by , ai-corona’s deep learning model consists of two main blocks; a feature extractor and a classifier. This is shown in Fig 1. The main challenge is mapping a 3-dimensional CT scan, which is a series of image slices, to a probability vector with a length equal to the number of classes. Another challenge is that all the scans not having the same number of slices and not all the slices being useful for diagnosis. To address this, we take the middle 50% image slices in each scan and denote the number of selected slices from each scan with S. We also experimented with other slice selection strategies (e.g. portion larger than 50%, top/bottom 50%, etc.), from which none performed better.
The total number of utilized slices is labeled by S. Each selected slice is fed to the feature extractor block pipeline one by one so that we end up with S vectors, which are then transformed to a single vector via an average pooling function. Afterwards, the result is passed through a fully connected network to reach the three output neurons, corresponding to our three classes.
As shown in Fig 1, the feature extractor block is a pipeline, receiving each slice with dimensions 512 × 512 × 3 (3 represents the number of color channels, but with all channels being exactly the same as for each image) and outputting a vector of length 1536 through an average pooling function. After all the slices have passed through the feature extractor block, we end up with S vectors. After all the S slices have passed through the feature extractor block, another average pooling is applied to the results which yields a single vector of length 1536.
This pipeline manner ensures that our framework is independent of the number of slices in a CT scan, as we always end up with a single vector of length 1536 at the end of the feature extractor block. The pipeline receives different number of slices, extracts their features, and finally outputs a single vector of known length. Moreover, the use of only a single feature extractor significantly reduces the computational load of our framework, resulting in a much faster training and prediction time.
Convolutional Neural Networks (CNN) were used for the feature extraction block. We experimented with different CNN models, such as DenseNet, ResNet, Xception, and EfficientNetB0 through EfficientNetB5 [36–39], taking into account their accuracy and accuracy density on the ImageNet dataset . All of these models were initialized with their respective pre-trained weights on the ImageNet dataset. In the end, the EfficientNetB3 model stripped of its last dense layers was chosen as the primary feature extractor for our deep learning framework. The vector output of the EfficientNetB3 feature extraction block is then passed through the classifier block, which contains yet another average pooling layer that is connected to the model’s output neurons corresponding to the classes via a dense network of connections.ai-corona is implemented with Python 3.7  and Keras 2.3  framework and was trained on NVIDIA GeForce RTX 2080 Ti for 60 epochs in a total of three hours. The Pydicom  package was used to read the DICOM file of the cases.
Class activation maps
To generate the class activation map of an image slice, we computed a weighted average across the 1536 values of the feature vector using weights from the classification block to obtain a 10 × 10 image. The resulting map was then mapped to a color scheme, upsampled to 512 × 512 pixels, and overlaid with the original input image slice. Employing parameters from the classification block to weigh the feature vectors makes, more predictive features appear more bright. This leads to regions of the image slice that most influence the model’s prediction to appear brighter. The class activation maps highlight which pixels in an image slice are important for the model’s prediction .
In order to quantify the reliability of our findings and the performance of our results based on the model’s detection of COVID-19 in chest CT scans, we provide a thorough comparison with expert practicing radiologists’ diagnosis. To achieve a more conservative discrimination strategy, we compute the following evaluation criteria ranging from sensitivity (true positive rate), specificity (true negative rate), F1-score, Cohen’s kappa, and finally to AUC. Moreover, the confusion matrix for all the classes of each individual study is also calculated.
We set the presence of the underlying class with a positive label and the rest of the classes assigned by a negative label. Incorporating error propagation and using the Bayesian statistics, we calculate the marginalized confidence region at a 95% level for each computed quantity. The significance of diagnostic results is examined by computing the p-value statistics systematically. To achieve a conservative decision, the 3σ significance level is usually considered.
Since the radiologists’ diagnosis is given by “Yes” or “No” statements for each class, it is necessary to convert the probability values computed by our model to binary values. Hence, we selected an operating point for distinguishing a given case among others and compute the true positive rate (sensitivity) versus false positive rate (1-specificity). This operating point was selected such that the model would yield a high specificity. To make more sense, as well as the other mentioned evaluation criteria, the Receiver Operating Characteristic (ROC) diagram is also estimated for our studies. All of our criteria were calculated using the scikit-learn  package.
Our team of experts annotated cases in the CC-CCII test set and MDH test set, with “Yes” and “No” labels for each class. To prevent a loss in experts’ diagnosis performance due to fatigue, they were asked to work on small time chunks. Their performance was then evaluated and recorded. Next, to evaluate the impact of AI assistance on the experts’ performance, after an appropriate amount of time and shuffling the sets (to prevent any remembrance), the experts re-annotated the two sets for a second time, while this time having access to the output of the model. They incorporated the model’s opinion for suspicious cases on their own authority. Their performance was evaluated and recorded again.
Our team of four experts incorporates two practicing academic senior radiologists with 15 years of experience each. In our study, they’re referred to as Senior Radiologist 1 and Senior Radiologist 2. Another expert is a practicing academic radiologist with 5 years of experience, which is referred to as Junior Radiologist. The last member is a senior radiology resident, referred to as Radiology Resident. The team of experts was chosen such that a wide range of experience and background knowledge would be present for our studies, in order to make it more comprehensive.
Training, evaluation, and testing datasets
To develop ai-corona, we utilized data from three different sources: (1) the MDH cohort, (2) the publicly available CC-CCII dataset , and (3) the publicly available MosMedData cohort. The combined data were from multiple international sites and comprised of 7184 CT scans from 5693 subjects categorized into five classes: Normal, CP, NCA, non-pneumonia, and COVID-19. For a better comparison of the diagnosis performance between RT-PCR and CT scans, the RT-PCR test records of 2672 patients in a 7-month period were gathered.
The MDH and the CC-CCII data were used for training, evaluation (tuning), and testing. The MosMedData was used entirely for testing. Overall, 5322 scans from 3985 subjects were used for training and tuning, and three sets were used for testing: (1) CC-CCII test set (105 Normal, 147 CP, and 143 COVID-19 scans), (2) MDH test set (121 Normal, 117 NCA, and 119 COVID-19 scans), and (3) the entire MosMedData cohort (254 non-pneumonia and 856 COVID-19 scans).
Taking into consideration the ground truth annotation of all the works involved, the CC-CCII test set was used to compare ai-corona with the models proposed by Zhang et al. , Jin et al. , and with expert radiologists. Furthermore, the MDH test set was used to compare ai-corona with the radiologists and RT-PCR. Lastly, The MosMedData cohort was used to compare ai-corona with the model proposed by Jin et al. .
Since the truth annotation methodology described in the Data subsection yields accurate labels, it was used to annotate a separate set for RT-PCR evaluation. This set is used to showcase the evolution of RT-PCR’s sensitivity over a period of 7 months in Fig 2 (sensitivity of each day is calculated as the average sensitivity of a 15-day period centered around that day). RT-PCR’s sensitivity oscillates in the range [0.351, 0.722]. The decrease in sensitivity to 0.351 on April 29, 2020, is due to changing the specimen obtaining method to oropharygeal wash . This changed later and nasopharyngeal and oropharyngeal swabs were used. The biggest value for RT-PCR’s sensitivity in this evaluation is considered its best, denoted by RT-PCR Best.
Performance evaluation and comparison
Having three test sets, our framework’s COVID-19 diagnosis performance for the CC-CCII test set, MDH test set, and the MosMedData cohort for all the studies is evaluated (an operating point was selected for each study). The confusion matrices for our evaluation results can be found in Fig 3. Moreover, for the COVID-19 class, ROC curves are showcased in Fig 4 and a more thorough look using the four metrics is depicted in Fig 5a and 5b. At last, the complete numerical reports for this evaluation can be found in Table 2. Values denoted with “-” in the table indicate a lack of report.
Bottom row left and middle: confusion matrices for ai-corona on the MosMedData cohort, respectively. Bottom row right: confusion matrix for the model proposed by Jin et al.  for the MosMedData cohort.
Diagrams in the bottom row correspond to a zoom-in of their respective curves. Hollow shapes represent an expert un-aided by AI, where filled shapes represent expert with AI assistance. As RT-PCR sensitivity was not available, its sensitivity is shown as a solid line in (b).
Hollow shapes represent an expert un-aided by AI, where filled shapes represent expert with AI assistance.
A “-” value indicates a lack of data. Reports in sections A, B, and C correspond to the CC-CCI test set, the MDH test set, and the MosMedData cohort, respectively.
Fig 3a through Fig 3c show that ai-corona has performed better in all three classes (Normal, CP, COVID-19) compared to Zhang et al.  and Jin et al.  on the CC-CCII test set and achieves an AUC score of 0.997, sensitivity of 0.972, and specificity of 0.968 on the COVID-19 class. The confusion matrix in Fig 3d showcases our framework’s performance on the MDH test set for the three classes of Normal, NCA, and COVID-19. For this dataset, our framework gains scores of 0.989, 0.924, and 0.983 for AUC, sensitivity, and specificity, respectively. In addition, Fig 3e and 3f showcase that our framework surpasses that of proposed by Jin et al.  on the MosMedData cohort with an AUC of 0.954. Although both have similar sensitivities in COVID-19 diagnosis, ai-corona outperforms Jin et al.’s model in non-pneumonia diagnosis with 83.07% accuracy, reporting fewer false positives.
The better diagnosis performance over the CC-CCII test set indicates that the task of diagnosing NCA from the other classes is indeed more difficult than diagnosing CP from the other classes. This due to all the different abnormalities present in the NCA class having their unique imaging features.
Comparison with experts and RT-PCR
Fig 4(a) and the top diagram of Fig 5 showcase the COVID-19 diagnosis performance of ai-corona and its comparison with that of experts for the CC-CCII test set. As shown, our framework performs better in all cases (except for the specificity of Senior radiologist 1). Furthermore, Fig 4(a) and the bottom diagram of Fig 5 showcase the same comparison, but for the MDH test set. This time, the framework performed similar to radiologists in specificity, but outperformed in the other metrics. In this comparison, 93.3% of COVID-19 cases in the MDH test set (111 of 119) were diagnosed as infected by at least one expert. Out of the other 8 that were not, our framework managed to report one and RT-PCR reported three as infected. If RT-PCR was the only criteria for the truth annotation, the overall sensitivity of radiologists would improve to 97%, which would further confirm the findings in . The complete reports for these two evaluations are in sections a and b of Table 2.
In Fig 4(b), the sensitivity of RT-PCR based diagnosis and CT based diagnosis is compared. The figure shows that RT-PCR Best sensitivity of 0.722 is lower than every expert diagnosing via CT. The RT-PCR Best sensitivity is an upper bound. Because if instead of testing patients hospitalized for more than three days, every COVID-19 admitted patient was tested, RT-PCR’s sensitivity would be much lower than 0.722.
Model as expert assistant
The goal of any AI assistant model is to improve the diagnosis performance of experts. For the evaluation, first, the radiologists annotate the test set. After an appropriate amount of time, the radiologist re-annotated the set for a second time while having the diagnosis of ai-corona for the entire set. The test set was also shuffled the second time to eliminate any remembrance of cases. Experts’ diagnosis performance is depicted in Fig 5. For the CC-CCII test set, all the experts (except the radiology Resident) had an improvement in their sensitivity. A significant improvement in the other metrics is also seen for everyone (except Senior 1). For the MDH test set, improvement in sensitivity can be seen for Senior 1 and Junior. Specificity only had an improvement in the Radiology Resident and remained unchanged for others. In every other evaluation criterion, the AI model had a positive impact on the experts’ performance.
Interpretation of ai-corona
To ensure that ai-corona was learning the correct imaging features, class activation maps were generated Fig 6. This is done by following the methodology described in the Introduction section. In a class activation map of a slice, more predictive areas (that hold the correct imaging features) appear brighter. Thus, the brightest areas of the class activation map correspond to regions that most influence the model’s prediction.
Additional evaluations were made as well, for which the an be found in the Supporting Information section. First, over the MDH test set, the performance of diagnosis between NCA and Normal classes was evaluated using the four metrics and was compared to the experts. Furthermore, all of the possible comparisons between every pair of classes were made to ensure the thoroughness and completeness of our evaluation which is showcased in S1–S6 Tables. As an example, this extra study showcased that radiologists perform better in diagnosing NCA from Normal compared to the AI model.
Lastly, it is important to note the speed at which different methodologies perform diagnosis. As shown in, RT-PCR is extremely slow. Moreover, our framework is faster than the best radiologist by 25 orders of magnitude. This is showcased in Table 3.
Conclusion and discussion
We introduce ai-corona, a radiologist-assistant deep learning framework capable of accurate COVID-19 diagnosis in chest CT scans. Our deep learning framework was developed (training and tuning) on 5322 scans, 3985 subjects, gathered from cohorts from two countries, China and Iran, and was tested against three sets; the CC-CCII test set from China (395 scans, 252 subjects), MDH test set from Iran (357 scans, 346 subjects), and the MosMedData cohort from Russia (1110 scans, 1110 subjects). Our framework was able to learn to diagnose patients infected with COVID-19, as well as being able to distinguish between COVID-19, other types of common pneumonia (CP) such as viral and bacterial, and other non COVID-19 abnormalities (NCA). Moreover, a set of 2672 subjects was used to calculate the sensitivity of RT-PCR.
The use of multiple datasets, each with scans differing in the number of slices, and a lack of slice-specific labeling, presented a challenge for this work. To address this, we dynamically select the middle 50% of slices in each scan and feed them to a single EfficientNetB3-based feature extractor, which after an average pooling operator, will result in a single feature vector that will be classified. This method, alongside the use of only one 2D CNN, will not only make our framework more robust, but it will also make its predictions faster and capable of running on slower hardware.
Our framework was compared to two other AI models, proposed by Zhang et al.  and Jin et al.  respectively. Its diagnosis performance is also compared to that of experts and other means of diagnosis in order to achieve a comprehensive and sensible image of the framework’s abilities. In the end, ai-corona managed to outperform the two other AI models in COVID-19 diagnosis. Our framework achieves high sensitivity, while also having a high specificity.
Our framework achieved an AUC score of 0.997 on the CC-CCII test set and performed better than the models proposed by Zhang et al.  and Jin et al.  on all four metrics. On the MDH test set, ai-corona gained an AUC score of 0.989 and performed mostly better in all of the metrics compared to the experts. It is worth mentioning that for our framework, diagnosing between the COVID-19 and CP classes was easier than between COVID-19 and NCA. Yet for the experts, it was the opposite. RT-PCR, as another method of diagnosis, had a sensitivity of 0.722 at best, worse than all the experts and the AI. At last, our framework gained a 0.954 AUC score on the MosMedData cohort, which outperforms Jin et al. . A complete report of these evaluations can be found in Fig 3 through Fig 5 and Table 2.
In COVID-19 diagnosis, ai-corona’s impact on assisting experts’ diagnosis was evaluated, which in COVID-19 diagnosis, mostly indicates a positive improvement on at least their sensitivity or specificity. This improvement is most noticeable for the Junior radiologist and the radiology Resident. Additionally, incorporation of the class activation maps in the experts’ diagnosis can help them examine the involved regions better.
On having a positive impact on experts’ diagnosis, two cases are discussed here to showcase how ai-corona made experts change their minds for good in suspicious cases. At least one expert misdiagnosed Fig 7(a)’s case as NCA at first, but upon seeing the AI’s diagnosis, correctly diagnosed as COVID-19. This expert cited seeing Peribrochovascular distribution, which is not common in COVID-19 (no subpleural distribution), as the reason for their misdiagnosis. In addition, Fig 7(b)’s case was initially misdiagnosed as COVID-19 by at least one expert, but was changed correctly to NCA when seeing the AI’s correct diagnosis. They cited that cavity, centrilobular nodule, mass, and mass-like consolidations are not commonly seen in COVID-19 pneumonia and might implicate other diagnostics.Fig 3
(a), (b), and (c), are the chest CT scans of patients who were initially misdiagnosed by at least one radiologist but were then diagnosed correctly upon incorporating ai-corona ’s correct prediction. (d) shows the chest CT scans of a patient that was misdiagnosed by ai-corona and radiologists.
The success of AI in medical imaging-based diagnosis has been proven by this work and many others before it. ai-corona can positively influence an expert’s opinion and improve the speed at which the subject screening process occurs, such that it helps critical cases get the care they urgently need faster.
But our work has its own drawbacks and shortcomings. Since the gathering of a dataset with better labeling (one that alongside its accurate annotations also accompanies localization and slice labels) is time and resource consuming, we decided to opt for an approach that favors robustness and is capable of learning on a simpler dataset. Developing our framework on a better dataset would certainly improve its performance. In addition, the CP class contains all kinds of conditions and diseases that cause pneumonia. As each of these conditions and diseases have their own distinct imaging features, having separate classes for them, especially Influenza-A, would improve the framework’s performance. Lastly, our framework’s learning would certainly benefit from more cases that are positive for COVID-19, yet have a negative RT-PCR result. As these cases are mostly experiencing the early stages of the infection, diagnosing them is more difficult. Moreover, classifying cases with a negative RT-PCR as non COVID-19 is illogical and their labeling protocol should be something else.
In the future, approaches that do a better job incorporating clinical reports with the imaging data should be explored. In conclusion, with the individual drawbacks of diagnosing based on clinical representation, RT-PCR, and CT-based diagnosis, a method comprised of all three would definitely yield the most accurate diagnosis of COVID-19.
S1 Fig. The cascade structure of the MDH. The number of subjects and scans in each split and set is indicated.
The preliminary dataset was cleaned, by removing abdomen and high-resolution CT scans. The train and tuning sets were labeled by two expert radiologists. The NCA and Normal classes of the test set was re-annotated by three expert radiologist (one new). The COVID-19 class are patients that meet our criteria and were hospitalized for more than three days.
S2 Fig. The left panel corresponds to the distribution of image slices for cases in the MDH, the middle panel shows the distribution of age, while the right panel illustrates the sex distribution of cases in the MDH.
S3 Fig. The ROC diagram representing the performance of various pipelines for the different combinations of comparison.
The Solid black line is for ai-corona by adapting different discrimination threshold value which is used to convert the continuous probability to binary “Yes” or “No” results. The filled triangle symbols are the (1-specificity, sensitivity) for the individual clinical experts, while the filled circle symbols are for the model-assisted radiologist. The inset plots magnify the highest part of sensitivity and specificity.
S1 Table. The quantitative evaluation of ai-corona, radiologists, and AI-assisted radiologists’ performance results for differentiating between the COVID-19 class and the Normal class at a 95% confidence interval.
S2 Table. The quantitative evaluation of ai-corona, radiologists, and AI-assisted radiologists’ performance results for differentiating between the COVID-19 class and the NCA class at a 95% confidence interval.
S3 Table. The quantitative evaluation of ai-corona, radiologists, and AI-assisted radiologists’ performance results for differentiating between the NCA class and the other classes at a 95% confidence interval.
S4 Table. The quantitative evaluation of ai-corona, radiologists, and AI-assisted radiologists’ performance results for differentiating between the Normal class and the other classes at a 95% confidence interval.
S5 Table. The quantitative evaluation of ai-corona, radiologists, and AI-assisted radiologists’ performance results for differentiating between the NCA class and the Normal class at a 95% confidence interval.
Our framework is available to expert professionals and the public healthcare via the website at ai-corona.com for free and unlimited use, where they can upload a chest CT scan and have it diagnosed for COVID-19 infection. The authors would like to express their gratitude to the Masih Daneshvari Hospital, Zahra Yousefi, Abbas Danesh, Negar Bandegani, and Shahram Kahkouee for all their hard work and assistance in this project. We appreciate Prof. Babak A. Ardekani at the Nathan S. Kline Institute and Erfan Zabeh, a Ph.D student at Columbia University, for their excellent comments and editing the manuscript. The computational part of this work was carried out on the Brain Engineering Research Center and the High-Performance Computing Cluster of the Institute for Research in Fundamental Sciences (IPM).
- 1. “Coronavirus Cases.” Worldometer, www.worldometers.info/coronavirus/.
- 2. Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The lancet. 2020 Feb 15;395(10223):497–506.
- 3. Chen N, Zhou M, Dong X, Qu J, Gong F, Han Y, et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. The lancet. 2020 Feb 15;395(10223):507–13. pmid:32007143
- 4. Kujawski SA, Wong KK, Collins JP, Epstein L, Killerby ME, Midgley CM, et al. First 12 patients with coronavirus disease 2019 (COVID-19) in the United States. MedRxiv. 2020 Jan 1.
- 5. Forni G, Mantovani A. COVID-19 vaccines: where we stand and challenges ahead. Cell Death Differentiation. 2021 Feb;28(2):626–39.
- 6. Centers for Disease Control and Prevention. Interim Guidelines for Collecting, Handling, and Testing Clinical Specimens from Persons Under Investigation (PUIs) for Coronavirus Disease 2019 (COVID-19). 2020. www.cdc.gov/coronavirus/2019-ncov/lab/guidelines-clinical-specimens.html. Published February 14, 2020. Accessed April 14, 2020.
- 7. Wang W, Xu Y, Gao R, Lu R, Han K, Wu G, et al. Detection of SARS-CoV-2 in different types of clinical specimens. Jama. 2020 May 12;323(18):1843–4. pmid:32159775
- 8. Ai T, Yang Z, Hou H, Zhan C, Chen C, Lv W, et al. Correlation of chest CT and RT-PCR testing for coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology. 2020 Aug;296(2):E32–40. pmid:32101510
- 9. Fang Y, Zhang H, Xie J, Lin M, Ying L, Pang P, et al. Sensitivity of chest CT for COVID-19: comparison to RT-PCR. Radiology. 2020 Aug;296(2):E115–7. pmid:32073353
- 10. Surkova E, Nikolayevskyy V, Drobniewski F. False-positive COVID-19 results: hidden problems and costs. The Lancet Respiratory Medicine. 2020 Dec 1;8(12):1167–8.
- 11. Shi H, Han X, Jiang N, Cao Y, Alwalid O, Gu J, et al. Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study. The Lancet infectious diseases. 2020 Apr 1;20(4):425–34. pmid:32105637
- 12. Mahdavi M, Choubdar H, Zabeh E, Rieder M, Safavi-Naeini S, Khanlarzadeh V, et al. Early detection of COVID-19 mortality risk using non-invasive clinical characteristics.
- 13. Revel MP, Parkar AP, Prosch H, Silva M, Sverzellati N, Gleeson F, et al. COVID-19 patients and the Radiology department–advice from the European Society of Radiology (ESR) and the European Society of Thoracic Imaging (ESTI). European radiology. 2020 Sep;30(9):4903–9. pmid:32314058
- 14. Bernheim A, Mei X, Huang M, Yang Y, Fayad ZA, Zhang N, et al. Chest CT findings in coronavirus disease-19 (COVID-19): relationship to duration of infection. Radiology. 2020 Feb 20:200463. pmid:32077789
- 15. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The lancet digital health. 2019 Oct 1;1(6):e271–97. pmid:33323251
- 16. Ardila D, Kiraly AP, Bharadwaj S, Choi B, Reicher JJ, Peng L, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature medicine. 2019 Jun;25(6):954–61. pmid:31110349
- 17. Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature medicine. 2018 Oct;24(10):1559–67. pmid:30224757
- 18. Fourcade A, Khonsari RH. Deep learning in medical image analysis: A third eye for doctors. Journal of stomatology, oral and maxillofacial surgery. 2019 Sep 1;120(4):279–88.
- 19. Maas B, Zabeh E, Arabshahi S. QuickTumorNet: Fast Automatic Multi-Class Segmentation of Brain Tumors. arXiv preprint arXiv:2012.12410. 2020 Dec 22.
- 20. Chowdhury ME, Rahman T, Khandakar A, Mazhar R, Kadir MA, Mahbub ZB, et al. Can AI help in screening viral and COVID-19 pneumonia?. IEEE Access. 2020 Jul 20;8:132665–76.
- 21. Ibrahim AU, Ozsoz M, Serte S, Al-Turjman F, Yakoi PS. Pneumonia classification using deep learning from chest X-ray images during COVID-19. Cognitive Computation. 2021 Jan 4:1–3.
- 22. Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. 2017 Nov 14.
- 23. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 2097–2106).
- 24. Rajpurkar P, Irvin J, Ball RL, Zhu K, Yang B, Mehta H, et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS medicine. 2018 Nov 20;15(11):e1002686. pmid:30457988
- 25. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI Conference on Artificial Intelligence 2019 Jul 17 (Vol. 33, No. 01, pp. 590–597).
- 26. Bien N, Rajpurkar P, Ball RL, Irvin J, Park A, Jones E, et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of MRNet. PLoS medicine. 2018 Nov 27;15(11):e1002699. pmid:30481176
- 27. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012;25:1097–105.
- 28. Li L, Qin L, Xu Z, Yin Y, Wang X, Kong B, et al. Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT. Radiology. 2020 Mar 19.
- 29. Zhang K, Liu X, Shen J, Li Z, Sang Y, Wu X, et al. Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of COVID-19 pneumonia using computed tomography. Cell. 2020 Jun 11;181(6):1423–33. pmid:32416069
- 30. Jin C, Chen W, Cao Y, Xu Z, Tan Z, Zhang X, et al. Development and evaluation of an artificial intelligence system for COVID-19 diagnosis. Nature communications. 2020 Oct 9;11(1):1–4. pmid:33037212
- 31. Armato SG III, McLennan G, Bidaut L, McNitt Gray MF, Meyer CR, Reeves AP, et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical physics. 2011 Feb;38(2):915–31.
- 32. Tianchi Competition. https://tianchi.aliyun.com/competition/entrance/231601/information (2017).
- 33. Morozov SP, Andreychenko AE, Pavlov NA, Vladzymyrskyy AV, Ledikhova NV, Gombolevskiy VA, et al. Mosmeddata: Chest ct scans with covid-19 related findings dataset. arXiv preprint arXiv:2005.06465. 2020 May 13.
- 34. Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020 Jan 23;25(3):2000045.
- 35. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition 2009 Jun 20 (pp. 248–255). Ieee.
- 36. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 4700–4708).
- 37. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770–778).
- 38. Chollet F. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 1251–1258).
- 39. Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational Conference on Machine Learning 2019 May 24 (pp. 6105–6114). PMLR.
- 40. Bianco S, Cadene R, Celona L, Napoletano P. Benchmark analysis of representative deep neural network architectures. IEEE Access. 2018 Oct 24;6:64270–7.
- 41. Van Rossum G, Drake FL Jr. Python reference manual. Amsterdam: Centrum voor Wiskunde en Informatica; 1995 May.
- 42. Chollet F. keras. (2015).
- 43. Mason D. SU-E-T-33: Pydicom: an open source DICOM library. Med Phys. 2011;38:3493.
- 44. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 2921–2929).
- 45. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011 Nov 1;12:2825–30.
- 46. Mohammadi A, Esmaeilzadeh E, Li Y, Bosch RJ, Li JZ. SARS-CoV-2 detection in different respiratory sites: A systematic review and meta-analysis. EBioMedicine. 2020 Sep 1;59:102903.