Visual versus Automated Evaluation of Chest Computed Tomography for the Presence of Chronic Obstructive Pulmonary Disease

Background Incidental CT findings may provide an opportunity for early detection of chronic obstructive pulmonary disease (COPD), which may prove important in CT-based lung cancer screening setting. We aimed to determine the diagnostic performance of human observers to visually evaluate COPD presence on CT images, in comparison to automated evaluation using quantitative CT measures. Methods This study was approved by the Dutch Ministry of Health and the institutional review board. All participants provided written informed consent. We studied 266 heavy smokers enrolled in a lung cancer screening trial. All subjects underwent volumetric inspiratory and expiratory chest computed tomography (CT). Pulmonary function testing was used as the reference standard for COPD. We evaluated the diagnostic performance of eight observers and one automated model based on quantitative CT measures. Results The prevalence of COPD in the study population was 44% (118/266), of whom 62% (73/118) had mild disease. The diagnostic accuracy was 74.1% in the automated evaluation, and ranged between 58.3% and 74.3% for the visual evaluation of CT images. The positive predictive value was 74.3% in the automated evaluation, and ranged between 52.9% and 74.7% for the visual evaluation. Interobserver variation was substantial, even within the subgroup of experienced observers. Agreement within observers yielded kappa values between 0.28 and 0.68, regardless of the level of expertise. The agreement between the observers and the automated CT model showed kappa values of 0.12–0.35. Conclusions Visual evaluation of COPD presence on chest CT images provides at best modest accuracy and is associated with substantial interobserver variation. Automated evaluation of COPD subjects using quantitative CT measures appears superior to visual evaluation by human observers.


Introduction
Emphysema and airways disease are common incidental findings on computed tomography (CT) performed for other reasons, offering the potential to identify subjects with undetected chronic obstructive pulmonary disease (COPD) [1]. COPD is one of the leading causes of death [2,3], and is expected to account for one in every 25 deaths in the developed world [2]. The disease is predominantly caused by tobacco exposure and is characterized by chronic airflow obstruction caused by emphysema and airways disease [4]. Since early smoking cessation prevents COPD disease progression [5,6] and evidence suggests that early intervention improves outcome [7,8], early diagnosis is crucial in managing this disease [9,10]. Unfortunately, symptoms occur late in course of the disease and early stages are substantially underdiagnosed [11,12]. Additionally, COPD is a predictor of cardiovascular mortality [13] and lung cancer [14,15]. Given these facts, and given that chest imaging is among the most commonly ordered radiological examinations, often ordered by non-pulmonary specialists in patients with an unknown COPD status, there has been considerable interest in the use of chest imaging to identify subjects with COPD. However, the general conclusion is that conventional chest radiography is insensitive in identifying mild to moderate COPD-related abnormalities [16][17][18][19]. Contrarily, COPD-related abnormalities (ie. airways disease and emphysema) are probably more readily detectable on chest CT as compared to conventional radiography. The Lung Screening Study supports this superior accuracy by showing that chest CT depicted 2.5 times more COPD-related changes compared to chest radiography [20].
Recently, it has been reported that using an automated CT model based on quantitative measures of emphysema and air trapping, identification of COPD subjects in a lung cancer screening setting was feasible with reasonable accuracy [21]. However, the reliability and accuracy of human observers to visually evaluate COPD presence on CT images is unknown. Therefore, we aimed to determine the diagnostic performance of human observers with various levels of expertise to visually evaluate COPD presence on CT images, and compare this to the performance of automated evaluation based on quantitative CT measures.

Ethics statement
This study was performed within the setting of the populationbased Dutch Belgian Lung Cancer Screening Trial (NELSONtrial; ISRCTN63545820) [22], which was approved by the Dutch Ministry of Health and by the local ethical review board ('Medisch Ethische Toetsingcommissie University Medical Center Utrecht'). To study COPD, expiratory CT acquisition was added to the screening protocol (ie. inspiratory CT and pulmonary function testing) in our center, starting July 2007. This addition was separately approved by the local ethical review board of the University Medical Center Utrecht (approval 03-040/C). Written informed consent was obtained from each participant.

Study population
The NELSON-trial enables valuable research into the early stages of COPD, which is more difficult in clinical routine because early COPD is not an indication for chest CT in our routine practice. Participants were all current or former heavy smokers meeting the inclusion criteria of the screening trial, as described previously [22]. Briefly, participants were heavy smokers between the ages of 50 and 75 year with at least 16.5 packyears of smoking history who were also physically fit enough to undergo potential surgery. For the present study we included a random sample of 266 male individuals who had lung function testing and a paired inspiratory and expiratory CT scan obtained on the same day between July 2007 and September 2008. This cohort is a representative sample of the total screening population. The comparison between our study population and the total screening trial population is shown in Table 1. A flow diagram of the study is shown in Figure 1.

CT scanning
Volumetric CT in inspiration and at end-expiration was obtained from lung bases to lung apices after standardized breathing instructions by a trained radiographer. CT images were acquired with 1660.75 mm collimation (Brilliance 16P; Philips Medical Systems, USA), and images with slice thickness of 1.0 mm at 0.7 mm increment were reconstructed using a smooth kernel (Bfilter; Philips). Dose settings were adjusted to body weight: subjects weighing 80 kg or less received 120 kVp at 30 mAs for the inspiratory acquisition and 90 kVp at 20 mAs for the expiratory acquisition. Subjects weighing over 80 kg received 140 kVp at 30 mAs for the inspiratory acquisition and 120 kVp at 20 mAs for the expiratory acquisition.

Pulmonary function testing
Pulmonary function testing without bronchodilator administration was performed on the same day as CT imaging. Spirometry was performed with ZAN equipment (ZAN Messgerä te GmbH, Oberthulba, Germany), according to the American Thoracic Society and European Respiratory Society guidelines [23]. The lung function testing included forced expiratory volume in the first second (FEV 1 ) and the ratio of FEV 1 over forced vital capacity (FEV 1 /FVC). The reference standard for COPD was a FEV 1 / FVC ratio less 0.70 [4].

Visual evaluation of CT images
Eight observers with various levels of expertise in evaluating chest CT images [24] participated in this study. The observers   Table 2 for more detailed information.
All CT images were anonymized and presented to the observers in a randomized order on a 3D research workstation (iXviewer, Image Sciences Institute, Utrecht, The Netherlands). For each case, the paired inspiratory and expiratory CT scans were shown alongside each other. The observers were able to view each scan completely and in any direction, corresponding to clinical routine. The observers were asked to judge whether lung function impairment was present in the case presented (ie. COPD present or absent), based on their evaluation of the presence and extent of emphysema, air trapping, airway abnormalities or any other finding on CT imaging. They were also provided with some basic patient characteristics, similar as applied in the automated evaluation (ie. age, body mass index, smoking status and smoking history; see next paragraph). To closely resemble daily practice, visual evaluation of the cases was performed without a prior

Automated evaluation of CT images
COPD presence was automatically evaluated, using a CT model that includes quantitative measures of CT emphysema and CT air trapping, age, body mass index, smoking status and smoking history. The model has previously been described in detail elsewhere [21]. In summary, the predicted probability for COPD presence was calculated using a regression equation (Probability COPD = 211.400+0.9036*CT emphysema(log)+0.1519*CT air trapping 20.0645*BMI+0.0083*packyears (20.7115 if former smoker). Based on the calculated probability, subjects were dichotomized as either COPD subjects or non-COPD subjects according to an optimal cut-off value [21].

Statistical Analyses
Kappa (k) values were calculated in order to assess intraobserver and interobserver agreement. Agreement was classified as poor when k was 0.20 or less, fair when between 0.21 and 0.40, moderate when between 0.41 and 0.60, good when between 0.61 and 0.80, and very good when higher than 0.80 [25]. Both the automated and the visual evaluation were compared to the reference standard of pulmonary function testing, and diagnostic performance was calculated in terms of the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and accuracy, all with 95% confidence intervals. Results are presented separately for the less experienced and experienced observers.
Diagnostic performance was compared between each observer and the automated evaluation by the CT model [26].
All analyses were performed with SPSS Version 15.0 for Windows (SPSS, Chicago, Illinois, USA). A p-value below 0.05 was considered statistically significant.

Study population
Our study population consisted of 266 heavily smoking male subjects with a mean 6 standard deviation age of 62.565.0 years. Detailed study population characteristics are presented in Table 3.

Observer agreement in CT-based evaluation of COPD presence
The intraobserver agreement ranged from a k-value of 0.28 to 0.68 (median 0.64) for the less experienced observers, and from 0.49 to 0.53 (median 0.49) for the experienced observers. The interobserver agreement for the less experienced observers yielded k-values between 0.18 and 0.55 (median 0.36). The interobserver agreement for the experienced observers yielded k-values between 0.35 and 0.57 (median 0.40).
The agreement between each less experienced observer and the automated CT model yielded k values between 0.12 and 0.30 (median 0.28). For the experienced observers this ranged between 0.20 and 0.35 (median 0.33). Results on the observer agreement are listed in Table 4.  Diagnostic performance for CT-based evaluation of COPD presence In our study population, 44.4% (118/266) of the subjects had COPD according to the reference standard. The percentage of subjects with suspected COPD after visual evaluation of the CT images by the human observers ranged from 25.9% to 60.2%. The accuracy of the less experienced observers ranged from 58.3% to 62.4%, and the positive predictive value ranged from 52.9% to 60.9%. For the experienced observers this was 64.7% to 73.3% for the accuracy, and 64.6% to 74.7% for the positive predicted value.
The percentage of subjects with suspected COPD after automated evaluation by the CT model was 38.0%. The automated CT model had an accuracy of 74.1% and a positive predicted value of 74.3%. Table 5 specifies the diagnostic performance measures for each observer and for the automated CT model.
Comparison between the automated evaluation by the CT model and the visual evaluation by the human observers shows that all but two observers had a significantly worse diagnostic performance in either sensitivity or specificity, or both (0.001,p,0.05). Only the specialized chest radiologist clearly approached the diagnostic performance of the CT model (p = 0.79), while a clear trend was seen for the other, less experienced observer (p = 0.06).

Discussion
In this study we report the diagnostic performance of human observers in identifying subjects with COPD using visual evaluation of lung cancer screening chest CT scans. Their performance was compared to the performance of automated evaluation of CT images. Accuracy of visual evaluation for COPD presence was modest, and the accuracy of the automated evaluation was higher than that of the observers. Diagnostic performance of the human observers seems to improve slightly with level of expertise, and approaches that of the automated model for the specialized chest radiologist. Nevertheless, intraobserver and interobserver variation was substantial, even in the most experienced observers. Our study demonstrates that although CT images contain diagnostic information related to COPD in a population with mainly early stages of disease, the reliability and diagnostic accuracy of visual evaluation is limited and certainly not better than automated evaluation.
The fairly low accuracy of visual evaluation for COPD presence shows that human observers experience difficulty in judging which lung abnormalities are functionally relevant. In addition, the limited intraobserver and interobserver agreement found indicates that human observers have their own subjective and inconsequent understanding of what COPD would look like on CT (ie. what type of abnormalities, and to what extent, will result in airflow Table 4. Intra-and interobserver agreement for CT based identification of COPD.  obstruction and abnormal lung function). This finding is in line with previous literature that has shown that visual evaluation of emphysema, air trapping and airway wall thickening are prone to considerable interobserver variability [27][28][29][30][31]. This, together with the modest diagnostic accuracy, has clinical implications: the extensive and increasing use of CT imaging [32], combined with the commendable practice of radiologists to report all imaging findings, including the incidental and unrequested ones, may lead to an increase in subjects who are wrongfully stigmatized based on the presence of COPD-related abnormalities on CT. Consequently, our study urges radiologist to remain cautious in interpreting these abnormalities and in reporting previously unknown disease. Whenever COPD is suspected based on CT findings, confirmatory lung function testing is required and should always be suggested.
Since CT-based lung cancer screening in heavy smokers is now recommended in the US [33,34] the chances to detect early COPD in high-risk subjects using screening CT images are increasing. At this stage, better understanding of functionally relevant CT abnormalities and improvement of observer agreement should be sought, which may lead to improved accuracy. On the other hand, identification of COPD can be based on automated evaluation using quantitative CT analysis, which we believe will become more important than that of visual evaluation; it is fast and inexpensive and the basic CT model, which at this stage includes only simple lung density measures and few patient characteristics, already performs better than the human observers. Its performance is approached only by the specialized chest radiologist, and it is unlikely that in daily practice the large amount of lung cancer screening CT scans will be reviewed by a specialized chest radiologist. Nevertheless, the quantitative approach needs to be further validated and improved, and clinical use might require more standardized CT operating procedures to limit differences between CT scanners and differences in breath hold procedures.
Our study is of importance since it addresses a common clinical problem, related to a disease with major healthcare impact. The main strengths are that we have used a representative sample of CT readers with various levels of expertise, and closely resembled clinical practice with 3D inspiratory and expiratory CT data and some clinical information. Also, we were able to provide data on a substantial number of subjects with early stages of COPD who are difficult to obtain in routine practice.
Our study has limitations. Firstly, spirometry was performed without administration of a bronchodilator, which is recommended to exclude asthma. However, we believe it is unlikely that this has significantly influenced the results because the prevalence of asthma in men between the ages of 50 and 75 years is only approximately 2% in the general population of the Netherlands [35], and our study population comprised only heavy smokers. Secondly, our study was limited to male subjects. This may limit the generalizability of our findings. Thirdly, our study evaluated functionally relevant lung abnormalities at the time of imaging. Given the cross-sectional nature of our study we cannot comment on whether observers identified subclinical abnormalities that may lead to abnormal lung function in the future. Lastly, we were unable to include more than one or two observers at each level of expertise, which impedes analysis within a group of similar experienced observers. Nevertheless, our results are based on a fairly large group of observers subdivided into a less experienced and experienced subgroup.
In conclusion, this study reports modest diagnostic accuracy of human observers in the visual evaluation for COPD presence on volumetric inspiratory and expiratory CT images in heavy smokers. Moreover, visual evaluation for COPD presence is associated with substantial observer variation. Our findings suggest that visual evaluation of CT scans for COPD presence is of limited diagnostic value, while there may be a role for automated evaluation. This may be important for the additional identification of COPD subjects in a CT-based lung cancer screening setting.