Predicting all-cause and lung cancer mortality using emphysema score progression rate between baseline and follow-up chest CT images: A comparison of risk model performances

Purpose Normalized emphysema score is a protocol-robust CT biomarker of mortality. We aimed to improve mortality prediction by including the emphysema score progression rate–its change over time–into the models. Method and materials CT scans from 6000 National Lung Screening Trial CT arm participants were included. Of these, 1810 died (445 lung cancer-specific). The remaining 4190 survivors were sampled with replacement up to 24432 to approximate the full cohort. Three overlapping subcohorts were formed which required participants to have images from specific screening rounds. Emphysema scores were obtained after resampling, normalization, and bullae cluster analysis of the original images. Base models contained solely the latest emphysema score. Progression models included emphysema score progression rate. Models were adjusted by including baseline age, sex, BMI, smoking status, smoking intensity, smoking duration, and previous COPD diagnosis. Cox proportional hazard models predicting all-cause and lung cancer mortality were compared by calculating the area under the curve per year follow-up. Results In the subcohort of participants with baseline and first annual follow-up scans, the analysis was performed on 4940 participants (23227 after resampling). Area under the curve for all-cause mortality predictions of the base and progression models 6 years after baseline were 0.564 (0.564 to 0.565) and 0.569 (0.568 to 0.569) when unadjusted, and 0.704 (0.703 to 0.704) to 0.705 (0.704 to 0.705) when adjusted. The respective performances predicting lung cancer mortality were 0.638 (0.637 to 0.639) and 0.643 (0.642 to 0.644) when unadjusted, and 0.724 (0.723 to 0.725) and 0.725 (0.725 to 0.726) when adjusted. Conclusion Including emphysema score progression rate into risk models shows no clinically relevant improvement in mortality risk prediction. This is because scan normalization does not adjust for an overall change in lung density. Adjusting for changes in smoking behavior is likely required to make this a clinically useful measure of emphysema progression.


Introduction
Visual assessment of emphysema extent in CT images of the lungs is a time-consuming process that requires training and is prone to subjectivity [1]. To solve these issues, a quantitative measure of emphysema was introduced [2]. Emphysema score (ES) is defined as the percentage voxels below -950 HU in the lungs [3,4]. However, its value as a clinical risk predictor is limited due to its dependence on different imaging protocols and reconstruction algorithms [5,6]; ES from different centers cannot be compared. Recently, normalized ES was introduced to overcome this issue by applying a normalization algorithm before calculating the ES [7]. It was subsequently shown using images from the National Lung Screening Trial (NLST) CT cohort that mortality risk groups based on normalized ES could better distinguish high-from low-risk individuals compared to those based on the original ES [8,9]. From this point on in this paper, "normalized ES" is referred to as "ES" unless stated otherwise.
Having shown that ES is a protocol-robust CT biomarker, we wanted to investigate whether mortality predictions would improve if a previous CT image was available, enabling the simple calculation of ES change per year (ES progression rate, EPR). This would be especially relevant in a lung cancer CT screening setting where a large, high-risk group is scanned on a yearly basis; even small improvements in a risk model's accuracy can have affect many people. Compared to having measurements at only one time point, disease prognosis should become more accurate when the disease progression rate is known. We hereby hypothesized that EPR could be used as an additional variable to improve the prediction accuracy of all-cause mortality and lung cancer mortality. In the situation where more than one prior CT image is available, it was assumed that a larger absolute difference in ES resulting from more time between the scans would decrease the proportion associated with within-participant random measurement errors. Our second hypothesis was that measuring EPR between scans separated by a larger time interval would improve the risk prediction accuracy.

Scans and data
Permission to use anonymized scans and data from the NLST study was obtained by the National Cancer Institute Cancer Data Access System under project ID NLST-422. In summary, the NLST was a multicenter randomized controlled trial which enrolled 53454 heavy smokers in the USA with the aim to investigate whether annual low-dose chest CT screening significantly reduces the lung cancer mortality compared to chest radiography [8]. The trial took place between 2002 and 2010; participants were screened in three annual rounds (T0, T1, and T2) and followed for another five years for outcome measures. In the CT arm, 26309, that information in some way. Only approved persons may therefore make usage of the data. We provide the pid (personal ID) numbers of the anonymized subjects from the NLST included in our study along with the emphysema score and emphysema score progression rate that we calculated. Data on all other variables may be obtained with permission from the National Cancer Institute Cancer Data Access System and derived based on the pid. National Cancer Institute contact information can also be found here: https:// biometry.nci.nih.gov/cdas/. 24715, and 24102 participants were screened in T0, T1, and T2, respectively; 1810 participants with at least one CT image available were confirmed dead, of which 445 died specifically of lung cancer (if not endpoint verification certified, is stated as cause of death on the death certificate). Note that published results may differ slightly from those in the database due to updates and corrections, and that incomplete or corrupted images were excluded from the study.
The NLST also kept track of the time between each screening round and the event, enabling survival analysis to be performed (see "Statistical analysis" section). Seven patient characteristics as collected from baseline questionnaires were selected to be included in the prediction model as adjustment factors to test whether ES and EPR are independent risk factors. These known predictors of emphysema were: age (years), sex, BMI (five categories: <18.5, 18.5-25, 25-30, 30-35, and �35), smoking status, cigarette smoking intensity (pack years), cigarette smoking duration (years), and previous diagnosis of COPD.

Data selection and subcohorts
The same 6000 participants used by Gallardo-Estrella and colleagues [9] were considered, which consisted of all deceased participants and a random selection of non-deceased participants from the NLST CT arm. Three overlapping study subcohorts were formed based on which screening round images per participant were available: the T0-T1, T0-T2, and T0-T1-T2 subcohorts. The T0-T1 and T0-T2 subcohorts were used to test the first hypothesis: that EPR is of added value for mortality risk prediction regardless of whether EPR was calculated between T0 and T1 or between T0 and T2 scans. The T0-T1-T2 subcohort excluded the largest number of available participants and therefore had a lower statistical power than the prior two cohorts, but a common data set was required to formally test the second hypothesis: that a two-year EPR would improve the prediction accuracy compared to a one-year EPR.

Emphysema quantification
ES was calculated using CIRRUS Lung Quantification (Diagnostic Image Analysis Group; Fraunhofer MEVIS). This process starts with the segmentation of the lungs using an algorithm based on region growing and morphological smoothing [10]. Subsequently, the images are resampled to 3mm slice thickness, normalized [7], and bullae analysis [11] is performed. The percentage of lung voxels less than -950 HU is calculated from the final image as ES [7,12]. EPR was defined as the difference between the latest available ES and the ES from the prior scan, adjusted by the number of years between the two scans, given by the following equation:

Statistical analysis
Data was collected and organized in Microsoft Excel; statistical analysis was performed using R statistical analysis package V.3.4.3. As we did not have the complete NLST non-deceased CT cohort, this subcohort was resampled to simulate the full NLST CT cohort (n = 24432) before excluding participants to form this study's three subcohorts (see "Data selection and subcohorts"). Cox proportional hazards regression analysis was performed (R package "survival") to predict the risk of two outcomes: all-cause mortality and lung cancer mortality. For each subcohort, a base model and a progression model were created; two progression models were created using the T0-T1-T2 subcohort. The base model contained one variable: the latest ES (at T1 or T2). The progression models additionally included EPR (average ES change per year  between T0 and T1, T0 and T2, or T1 and T2).
The analysis was subsequently performed on the adjusted version of the above-mentioned models, which included seven clinical patient characteristics (see section "Scans and data"). In total, 28 models were created; the ES and EPR beta coefficients and hazard ratios including 95% confidence intervals and Wald statistic p-values were recorded for each model. Beta coefficients represent the change in the natural log of the hazard ratio for each unit change in the associated risk factor; the estimated hazard ratio is calculated by multiplying each risk factor with its beta coefficient, summing them, and solving the exponential function of that value. Kaplan-Meier curves were computed with 95% confidence intervals using R package "ggfortify" to visualize the ability of each model to distinguish between different risk group percentiles of participants. The same risk group percentiles were used as Gallardo-Estrella et al. [9] (percentiles 0 to 60, 60 to 80, and 80 to 100), but were further subdivided into more groups of no smaller than 10% of the study population when there was a good visual separation.
The proportional hazards assumption was tested using Schoenfeld residuals (R package "survival"); a p-value below 0.05 is interpretable as a violation of this assumption. Martingale residual plots were used to visually assess whether each continuous variable had a linear relationship with the log hazard (R package "survminer"). The likelihood ratio test, Wald test, and logrank test p-values were obtained for each Cox model (R package "survival") to test goodness of fit.
To compare the predictive accuracy of each model, two time-dependent statistical methods were applied at each year follow-up. Area under the receiver operating characteristic curve (AUC) was used with the support of R package "ROCR" [13]; estimates were from applying the models to the bootstrap data. A value of 0.5 indicates that the model has no predictive ability. When comparing the AUCs of two models, a value closer to 1 indicates a higher accuracy (granted that the 95% confidence intervals do not encompass the other model's AUC). Average continuous net reclassification improvement (NRI)-defined as "1/2 NRI(>0)" by Pencina and colleagues [14]-was calculated using R package "survIDINRI" [15,16]. To calculate NRI, the net percentages of subjects with and without the event of interest correctly reclassified with a higher and lower risk score, respectively, are summed and halved (maximum range: -100% to 100%); positive scores indicate that the new model is more accurate (granted that the 95% confidence intervals do not encompass 0). 95% confidence intervals were obtained from 1000 bootstrap iterations with replacement for both methods.
To simulate the models' application in a screening setting, we calculated the sensitivity, specificity, positive predictive value, negative predictive value, and the ratio of non-deceased to deceased at several pragmatic percentile cut-off points. 95% confidence intervals were calculating using the Wilson procedure with continuity correction [17].

Study participants
Compared to non-deceased participants, more of the deceased participants were older, male, had a higher BMI, were active smokers with a more extensive smoking history, and were diagnosed with COPD (Table 1). Regarding adjustment factors, 17 participants were missing values for the variable "BMI" and 14 were missing "previous diagnosis of COPD"; these participants were excluded from the analysis.
Out of the 6000 available participants, 4940 (82.3%) were included in the T0-T1 subcohort, 4614 (76.9%) were included in the T0-T2 subcohort, and 4420 (73.7%) were included in the T0-T1-T2 subcohort (Fig 1). The respective percentage of deaths per subcohort were 26.4% Among all participants with a baseline scan, the derived ESs for each screening round as well as the EPRs between T0 and T1 and T0 and T2 can be found in the supplementary dataset (S1 Dataset). Figs 2 and 3 show the normalized T0 and T1 CT images with and without ES overlays of two participants, the former with an EPR of 21.6 and the latter -10.3.

Cox model assumptions
All variables in all models had a p-value greater than 0.05 when testing for the proportional hazard assumption. All models had p<0.0001 for the goodness of fit tests (likelihood ratio test, Wald test, and logrank test). The distribution of ES was right skewed, but attempts to transform this variable (i.e., log transformation, absolute value transformation, square root transformation, and polynomial transformation) did not maintain linearity. Similarly, transforming EPR had little effect due to the small spread. The other continuous adjustment variables (age, pack years, and smoking duration) remained slightly nonlinear regardless of transformation and were ultimately also left untransformed. Table 2 displays the beta coefficients and hazard ratios of the ES and EPR for all models developed from the resampled T0-T1 and T0-T2 subcohorts. ES was a statistically significant variable in all models (p<0.001); this also applied to EPR in the all-cause mortality prediction models. EPR was less accurate for lung cancer mortality in the unadjusted models, and not of added benefit when adjusting for patient characteristics (p>0.05). Beta coefficients and hazard ratios of the adjustment factors are displayed in Tables 3 and 4 Time-dependent AUCs and NRIs of the base and progression models are displayed in Tables 5 and 6. For both outcomes and across most time points, the resulting AUCs and NRIs indicated that the progression models had significantly higher accuracies than the base models when not adjusted for patient characteristics. When adjusted for patient characteristics, progression models were not statistically superior to the base models at most time points. Measures of clinical outcomes at several hazard ratio cut-off points (based on percentiles) are displayed in Table 7. Without information on patient characteristics, 80% of the participants had an estimated hazard ratio above 1.00 using both the base and one-year progression models. At this cut-off point, 29.1% (95% confidence interval: 26.7 to 31.7%) of all the deaths within six years were correctly predicted using solely T1 ES; per death, there were 11.3 (10.1 to 12.6) surviving "overdiagnosed" participants. If only 2% of those with the highest risk scores were selected, 4.7% (3.6 to 6.0%) of all deaths would be observed; however, the ratio of nondeceased to deceased would also decrease to 6.6 (4.8 to 9.4). With the addition of EPR, this ratio decreases to 5.4 (4.0 to 7.5); the ratios for the adjusted base and the adjusted one-year progression models are 3.5 (2.6 to 4.6) and 3.3 (2.5 to 4.4), respectively.

One-versus two-year progression models
Two ES progression models were developed from the resampled T0-T1-T2 subcohort: the one-and two-year EPR calculated from T1 and T2 scans and T0 and T2 scans, respectively. With or without adjustment factors, the absolute value of the two-year EPR beta coefficients (unadjusted: -0.08460 [-0.12219 to -0.04702]; adjusted: -0.06288 [-0.10910 to -0.01665]) were  Table 8). At most time points, the AUCs of the unadjusted progression models indicated a small gain in accuracy when using the two-year progression model compared to the one-year progression model (Table 9); the difference in AUCs ranged from 0.002 to 0.013. The differences in beta coefficients for predicting lung cancer mortality were smaller, had a larger variance, and did not lead to differences in model performance. No performance differences were seen among the adjusted models' AUCs. Adjusted or not, the time-dependent NRI comparisons did not show a statistically significant difference between the progression models (Table 10).

Discussion
To our best knowledge, no other study has yet attempted to measure pulmonary emphysema progression in CT scan using quantitative methods for potential use as a measure of disease progression over time. Our study demonstrated that when a prior chest CT scan is available, calculating the EPR in addition to the latest ES showed a very minor improvement to the allcause and lung cancer mortality prediction in a lung cancer screening population. On the other hand, when baseline patient information was available and included in the model-more specifically age, sex, weight, height, smoking history, and disease history-the value of EPR is negated completely. These results are also reflected in Table 7, which shows the predicted clinical outcomes when the population is separated into high versus low risk groups.
Though not an objective of the study, a positive result was that ES remained a highly significant predictor in all adjusted models developed (p < .001); this had not yet been demonstrated in the previous study by Gallardo-Estrella et al. [9]. ES should therefore continue to be considered for inclusion in future multivariable risk prediction models when available.
Among most progression models, it was observed that the absolute value of the models' two-year EPR beta coefficients were considerably larger than that of one-year EPR. A theory is that that calculating EPR using a larger time gap between two images would reduce the within- participant random measurement errors relative to the actual change in ES. We therefore hypothesized that two-year EPR is a superior mortality biomarker than one-year EPR and developed both models in a common data set where each participant had images from all screening rounds. Indeed, in the resampled T0-T1-T2 subcohort, we found a statistically significant higher AUC for the unadjusted two-year progression model than that of one-year; on the other hand, the magnitude was very small and again completely negated when adjusted for patient characteristics.
It is worth noting that most participants do not have an EPR large enough to have any influence on their risk score. Using a non-parametric test, there was no statistically significant difference in EPR between the deceased and non-deceased groups (Table 1). However, even among heavy smokers (at least 30 pack years, and no more than 15 years since quit among former smokers [8]), only 6.3% (359/5740) had the diagnosis COPD at baseline. The added value may therefore only apply for the small proportion with more severe emphysema. Careful consideration of the Kaplan-Meier curves reveals that the biggest change generally takes place in the highest risk group (p90-p100) (Figs 4-7). In most graphs, the five-or six-year survival probability of the p90-p100 risk group increases by about 1% and the p80-p90 group decreases by about 1% accompanied by a narrowing of confidence intervals for better separation from the p60-p80 group. Graphs of adjusted models-which could better distinguish between lower risk strata-additionally showed an improved separation between the p60-p70 and p70-p80 groups. In other words, the seemingly negligible benefit of EPR may in part be due to the small number of people that it applies to. Another discussion point is that emphysema is currently considered permanent and irreversible, so a moderate decrease in ES which is not explainable by random within-participant measurement errors should not be possible. Regardless, the EPR beta coefficients among all progression models were negative, meaning that a decrease in ES since the last scan increases the mortality risk. This is a paradoxical result which has been observed in previous studies [18][19][20][21][22][23]. The current explanation for this is that active smokers' lungs are in a constant state of inflammation causing a diffuse increase in density of the lung parenchyma; smoking cessation leads to the clearance of the inflammatory component resulting in a general decrease of lung density and revealing areas of decreased lung density below the threshold of being counted as emphysema, hereby increasing the ES. In a study cohort where about half of the participants were diagnosed with COPD, Grydeland and colleagues [18] reported non-normalized ES medians of 9.72 and 4.86 among former and current smokers, respectively. In our study, 40 out of the 4420 participants (0.9%) in our T0-T1-T2 subcohort had a T0 to T1 EPR less than -5 and 103 (2.3%) met these criteria from T1 to T2. On the other side of the spectrum, the numbers with an EPR greater than 5 were 135 (3.1%) and 93 (2.1%), respectively.
In our study, adjusting for whether the participant was an active smoker or not did not change the sign; this information was taken from a baseline questionnaire, hence it is not known whether the participants' smoking behavior had changed. It may therefore be conjectured that former smokers relapsed after the screening program failed to diagnose them with lung cancer [24]. Scan normalization does not change the overall scan density and can therefore not automatically adjust for changes in inflammation over time. To solve this issue, different density cut-off points may be used depending on the time since smoking abstinence. Another solution may be to adjust for the mean lung density, which would not require information on smoking behavior. This study has other limitations. Due to the very small spread, it is still not clear what kind of relationship EPR has with mortality risk; experimenting with various transformation methods did not appear to improve the linearity nor the model accuracy (results not included). Our inclusion criteria required participants to have survived one or two years following the baseline scan, which introduced selection bias.
The value of the current study is that it provides a thorough description and reasoning of the unexpected negative findings, which may contribute to ensuing investigations. A future study on validating EPR should either utilize a data set including patient characteristics reported at each scan time point or include another CT measure to adjust for differences in overall lung density.

Conclusions
It is expected that the prognosis accuracy of a disease should improve considerably when the rate of disease progression is known, which can only be recorded when measurements are made at multiple time points. This was not the case in our study, where we have shown that simply calculating the change in normalized emphysema score between two CT images does not clinically improve the prediction of all-cause and lung cancer mortality. We surmised that recent smoking behavior has a crucial impact on the emphysema score: Though the normalization method remains useful for comparing CT images created using different protocols and algorithms, it does not adjust for changes in the overall lung density which, for each individual, is prone to change over time. It is expected that a future study utilizing follow-up clinical information or adjusting for the mean lung density is necessary before emphysema score progression rate may be clinically useful as a quantitative CT biomarker. Table 5. Time-dependent receiver operating characteristic area under the curve comparing the emphysema score base models to the progression models predicting all-cause mortality and lung cancer mortality (resampled T0-T1 and T0-T2 subcohorts).

Unadjusted models Adjusted models
Years after T1

Base model AUC (95% CI)
One-year progression model AUC (95% CI) a All-cause mortality prediction    Table 9. Time-dependent receiver operating characteristic area under the curve comparing the base models against the progression models (resampled T0-T1-T2 subcohort).

Unadjusted models Adjusted models
Years after T2